Performance claims

Performance Observations &
Execution Characteristics

The DataForge engine has been validated across local and cloud execution environments using real-world datasets and production-representative configurations. All results reflect sustained throughput under controlled conditions and are reproducible with comparable inputs.

// Canonical metrics

Performance Summary

Validated results across execution environments and target systems.

8,151,597

rows / sec · enterprise hardware peak

Intel Xeon Gold 6326 · Pure Storage FlashArray X90R4

30 concurrent workers · 2.274B rows · zero failures · zero delta · Linux · PostgreSQL 18

3,032,564,040

rows · single machine · 40 workers · Δ0

Maximum validated scale · same hardware

Zero failed jobs · zero dropped rows · zero malformed · deterministic parity confirmed

2,516,818

rows / sec · 86.0 MB/sec throughput

Consumer NVMe · local PostgreSQL

75.8M rows · 30.1s elapsed · single worker · warmed system conditions

883,017

rows / sec

Cloud SQL (Enterprise Plus · 8 vCPU / 64 GB)

Executed via Cloud Run Jobs

~300,000

rows / sec

Local SQL Server ingestion

C# native implementation · SqlBulkCopy

249,353

rows / sec

Local SQL Server ingestion

Go implementation

~8×
throughput advantage
PostgreSQL vs SQL Server
Under comparable conditions and identical source data

413,325

rows / sec · no staging

SQL Server → PostgreSQL · direct DB-to-DB

75.8M rows · 3m03s elapsed · lossless · zero intermediary

241,009

rows / sec · no staging

PostgreSQL → SQL Server · direct DB-to-DB

75.8M rows · 5m14s elapsed · lossless · zero intermediary

// Verified runs

Benchmark Record

Two hardware classes. Same engine. Same pipeline. Same zero-delta guarantee.

// Enterprise hardware — Intel Xeon Gold 6326 · Pure Storage FlashArray X90R4 · Linux · PostgreSQL 18

2,274,000,000

Rows inserted · 30 workers

8,151,597

Rows / sec · peak

Δ0

Net row delta

Failed jobs

// Consumer hardware — AMD Ryzen 9 9950X3D (16C/32T) · X870 · consumer NVMe · Windows · PostgreSQL 18

2,840,000

Rows / sec · single worker (c1)

6,530,000

Rows / sec · 10-worker peak (fresh drive)

5,488,766

Rows / sec · 20-worker · storage-saturated

Δ0

Stage 3 parity · all runs

// What this demonstrates

Same binary, same pipeline, same zero-delta guarantee on consumer and enterprise hardware
~1.5× the prior 7950X ladder at the same concurrency on the same dataset
c10 fresh-drive 6.53M rows/sec; sustained back-to-back runs settle to a ~5.2M rows/sec storage-bound floor as the consumer NVMe's SLC write cache exhausts — a drive characteristic, not an engine limit
CPU never approached saturation on real workloads — engine is parse/storage-bound, not compute-bound

// 9950X3D ladder (citation_map, May 2026)

2,840,000 rows/sec c=1 · single worker · 75.8M rows

6,530,000 rows/sec c=10 · fresh-drive · 758M rows

5,488,766 rows/sec c=20 · storage-saturated · 1.516B rows

0 failed · 0 malformed · Stage 3 Δ0 throughout

// Single-ingestion bench reports (latest)

Auto-generated HTML reports from the bench harness — environment fingerprint, per-phase system metrics, PostgreSQL deltas, and Stage 3 reconciliation are included in each.

All reports were produced by running bench against a live dataforge-api instance — the same harness any prospect downloads from the Benchmark Kit. No staging, no post-processing.

// Enterprise hardware ceiling — supporting reference

// 40-worker run — enterprise hardware — maximum validated scale

3,032,564,040

total rows ingested · single machine · 40 concurrent workers

Peak throughput8,151,597 rows/sec
HardwareIntel Xeon Gold 6326 · Pure Storage X90R4
Failed jobs0
Dropped rows0
Row-count delta0

On the Pure Storage FlashArray X90R4, the bottleneck migrated entirely off storage and onto CPU. The engine reached 8.15M rows/sec sustained — 3.03 billion rows, one machine, zero failures, exact parity. The software introduced no ceiling of its own. This is what the same binary does on enterprise storage; the rest of this page is what it does on a consumer desktop.

// Full CourtListener corpus reconstitution — consumer desktop, 60 minutes

The Free Law Project's CourtListener bulk archive is the federal judiciary's complete digital record — 32 structurally heterogeneous datasets totaling 401 GB. Conventional restoration tooling typically takes the better part of a day, especially on the 347 GB opinions table. DataForge restored the entire corpus into an empty PostgreSQL 18 database in a single concurrent pass on a consumer desktop, with the source on one NVMe (D:) and the database on another (C:) to eliminate read/write contention. Tables were created fresh and loaded without secondary indexes or foreign-key constraints — standard bulk-load practice; index and constraint builds run after load and are not part of the 60.5-minute figure. Drive separation held; the engine did the rest.

// CourtListener full corpus — all 32 datasets — concurrent reconstitution — 2026-05-29

60.5 min

wall time · 401 GB heterogeneous source · 32 distinct schemas · single binary · one consumer desktop

Datasets ingested32 / 32 — including the 347 GB opinions table
Total rows2,608,874,016 — engine-recorded
Aggregate throughput719,109 rows/sec sustained for 60 min
HardwareAMD Ryzen 9 9950X3D (16C/32T) · 62 GB RAM · consumer NVMe · X870
Schema range3 cols → 51 cols (dockets) · ~150 B/row (opinions) → multi-KB/row
Failed jobs0
Malformed rows0
Dropped rows0
Row-count delta0 — lossless
Governor hard stops0

Read the full corpus report →

// Why this matters

A homogeneous concurrent test proves the scheduler and goroutine pool are stable. A heterogeneous corpus run proves the engine. Every dataset in the CourtListener corpus presents a different challenge: citation tables are wide and sparse; opinion clusters are dense with long text fields; opinions itself is 2.35 billion small rows totaling 347 GB; dockets carry 51 wide columns with highly variable row sizes. The engine adapted batch sizing and arena allocation per job — simultaneously — while the memory governor proportionally engaged once under genuine pressure (available memory dipped to 790 MB during the 32-job launch burst) and released cleanly without ever fully closing the gate. Zero failures across 2.6 billion rows.

Not a stress test. A real dataset. A real workload. A real result — on commodity hardware.

// Primary source

The full methodology, hardware configuration, NUMA anomaly data, WAL determinism analysis, and per-concurrency throughput curves for the enterprise benchmark runs are documented in the published whitepaper: Pressure Curves and Bottleneck Migration: Hardware Limits Revealed by the DataForge™

Read the Paper →

View Benchmark Evidence → Run It on Your Infrastructure

// Database-to-database transfer

Direct. No staging.
No orchestration layer.

DataForge streams directly from a source database into a destination database — no intermediate files, no ETL extract step, no message queue, no coordination service. The same Parse → Filter → Accumulate → Write pipeline that ingests flat files applies unchanged to live database sources across flavors.

// SQL Server → PostgreSQL

75,814,101

Rows transferred

3m 03s

Elapsed time

413,325

Rows / sec

Lossless

Parity signal

// PostgreSQL → SQL Server

75,814,101

Rows transferred

5m 14s

Elapsed time

241,009

Rows / sec

Lossless

Parity signal

// What was eliminated

No extract-to-file step
No intermediate staging table or bucket
No orchestration service or job scheduler
No message queue or change-data-capture layer
No manual schema mapping — columns streamed by position

// Parity verification

Both runs produced a confirmed lossless result: rows scanned equals rows written, with zero malformed, zero dropped, zero skipped. The engine's internal parity counter closes in both directions independently.

75,814,101 = 75,814,101 Scanned = Written · MS SQL → PostgreSQL

75,814,101 = 75,814,101 Scanned = Written · PostgreSQL → MS SQL

// Throughput asymmetry

The 1.7× difference between directions is protocol-driven, not engine-driven. PostgreSQL's COPY FROM STDIN issues a single streaming protocol call per batch. SQL Server's TDS bulk insert protocol carries per-row encoding and driver-layer overhead. The DataForge execution path is identical in both directions.

The asymmetry is a destination system characteristic. The engine is a constant.

// Analysis

Key Observations

The System Is Not Compute-Bound

Across all environments, high throughput is sustained with CPU utilization remaining low relative to output and disk I/O pressure remaining minimal under streaming conditions.

The limiting factor is not processing capacity, but data movement and destination system behavior.

Execution Model Determines Realized Throughput

Cloud results demonstrate a significant divergence between execution modes against identical infrastructure:

883K rows/sec Cloud Run Jobs — sustained CPU allocation

~15K rows/sec Cloud Run Services — lifecycle throttled post-202

Throughput is governed by execution model constraints, not infrastructure alone.

Destination System Characteristics Define Upper Bounds

The engine's performance ceiling is determined by the characteristics of the target system, not the ingestion pipeline itself.

PostgreSQL supports 2.5M+ rows/sec under warmed, optimized conditions via COPY FROM STDIN — single streaming protocol call per batch
SQL Server throughput is constrained by TDS bulk copy protocol overhead and UTF-16 encoding at the driver layer

Implementation Layer Impacts Throughput Efficiency

Observed differences between implementations on SQL Server under equivalent conditions:

~300K rows/sec C# · SqlBulkCopy · native protocol integration

249K rows/sec Go · go-mssqldb · driver and batching overhead

The engine remains constant. The execution layer determines how efficiently its performance is expressed.

Local vs Cloud Performance Characteristics

Local execution provides minimal network overhead, direct I/O access, and no container lifecycle constraints — producing the highest achievable throughput ceiling.

Cloud execution introduces network transfer overhead and service model constraints. Despite these differences, properly configured cloud execution remains within the same order of magnitude as local performance.

The 59× gap between Cloud Run Services and Jobs at constant hardware is attributable entirely to CPU allocation policy — not schema, data, or network.

Concurrency and Scaling Behavior

Performance improves with concurrency: overhead is amortized across workloads, throughput increases without proportional infrastructure expansion, and per-unit cost decreases as utilization rises.

Each Cloud Run Job handles one file at full throughput
The concurrency ladder scales linearly against the storage write ceiling
Infrastructure footprint remains stable as concurrency increases

Scaling is achieved through concurrency, not infrastructure growth.

Heterogeneous Concurrency Holds at Full-Corpus Scale

Running 32 structurally different tables concurrently is categorically distinct from running one table 32 times. Schema complexity, row size distribution, and per-dataset memory pressure all differ simultaneously. The engine holds:

401 GB of heterogeneous source data — including the 347 GB opinions table — ingested in about 60 minutes
Schema range 3 to 203 columns — same execution path for all
Zero failed jobs, zero dropped rows, exact parity confirmed across all 32 tables
2,608,874,016 rows reconstituted end to end with deterministic Δ0 parity

Concurrency stability is not schema-specific. The engine does not need to be tuned per dataset.

DB-to-DB Transfer Eliminates the Staging Layer Entirely

Conventional database migration paths require an extract step — writing to file, staging bucket, or intermediate table — before ingestion can begin. DataForge removes this step. The source database cursor is streamed directly into the destination write path using the same internal pipeline.

413K rows/sec sustained — SQL Server to PostgreSQL — 75.8M rows, lossless
241K rows/sec sustained — PostgreSQL to SQL Server — 75.8M rows, lossless
Cross-flavor: SQL Server and PostgreSQL are treated as interchangeable source and destination

No staging. No orchestration. No intermediary. Source cursor to destination write — one pipeline, one pass.

// Core insight

A different approach to throughput.

The distinction is architectural, not incremental.

Traditional systems

Increase throughput by adding infrastructure — more nodes, more services, more coordination layers. Each addition introduces latency, failure surface, and operational cost. The infrastructure is the answer to every throughput question.

DataForge

Increases throughput by increasing concurrency within a stable execution boundary — amortizing overhead, not multiplying it. The infrastructure is fixed; the throughput scales with utilization.

// Operational guidance

Operational Implications

Prefer sustained compute allocation

Use execution models that allow full CPU access for the duration of the job. Jobs over Services. Bare-metal over throttled containers. The mode is as important as the hardware.

Minimize unnecessary data movement

Network overhead is real and measurable. Colocate the execution environment with the target system where possible. Same-region deployments confirmed no throughput difference between public IP and VPC private IP.

Align target system selection with requirements

PostgreSQL and SQL Server carry different throughput ceilings under the same workload. System selection is a performance decision. The 8× gap exists at the protocol layer before DataForge is in the picture.

Treat runtime as an optimization layer

Language and driver selection affect how efficiently the engine's output is expressed. Native protocol integration outperforms abstraction layers at scale. The engine is constant; the delivery mechanism is a variable.

// Validation model

All results:

Derived from real-world datasets (CourtListener public corpus)
Executed under observable, repeatable conditions
Reproducible with comparable configurations and hardware class

DataForge consistently delivers high-throughput data movement across environments, with performance bounded primarily by external system constraints rather than internal processing limits.

Ready to
talk throughput?

Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.

Send an inquiry Read the research

Performance Observations &Execution Characteristics

Performance Summary

Benchmark Record

Direct. No staging.No orchestration layer.

Key Observations

The System Is Not Compute-Bound

Execution Model Determines Realized Throughput

Destination System Characteristics Define Upper Bounds

Implementation Layer Impacts Throughput Efficiency

Local vs Cloud Performance Characteristics

Concurrency and Scaling Behavior

Heterogeneous Concurrency Holds at Full-Corpus Scale

DB-to-DB Transfer Eliminates the Staging Layer Entirely

A different approach to throughput.

Operational Implications

All results:

Performance Observations &
Execution Characteristics

Direct. No staging.
No orchestration layer.