DataForge Benchmark Runbook
Step-by-step instructions for running the DataForge benchmark harness against your own PostgreSQL instance. Covers prerequisites, installation, configuration, execution, and result interpretation.
Engine: dataforge-0.4.x · Harness: bench + bench-report · Requires PostgreSQL 15+
Prerequisites
Installation
The benchmark kit ships as a tarball. Extract it and set execute permissions on the binaries.
sha256sum dataforge-bench-20260326-2cbe1b1.tar.gz # expected: 1b0a3d0b13257eab17b495fa3a2000a66d07b28ad2ceb9a6a44784f3510b438f sha256sum citation-map-2025-10-31.tar # expected: 248a8863f2bcdbb13468831db4210532b695af731bd3fcc6e643757874c62ae3
tar -xzf dataforge-bench-20260326-2cbe1b1.tar.gz cd dataforge-bench-20260326-2cbe1b1/ chmod +x bench bench-report ./bench --version
The kit contains:
bench— ingestion benchmark binarybench-report— report generator (reads run output JSON, produces HTML)bench.yaml— pre-built configuration template- This runbook URL
The DataForge API server starts automatically when bench runs. No separate service startup is required.
Configuration
Edit bench.yaml to point at your PostgreSQL instance. The remaining defaults are tuned for the standard benchmark suite.
database: host: localhost port: 5432 name: dataforge_bench user: bench_user password: "" # set via BENCH_DB_PASSWORD env var source: file: ./data/courtlistener_opinions_2024.csv format: csv benchmark: concurrency_presets: [1, 10, 20] # workers per run staging_strategy: per_worker_copy # each worker gets its own file copy output_dir: ./runs/ tag: "" # optional label for this run session
Staging strategy: per_worker_copy gives each worker its own copy of the source file, eliminating read I/O contention between workers. This is the correct strategy for characterizing ingestion throughput. It requires additional disk space (N workers × source file size).
Concurrency presets:
[1]— Baseline characterization. Isolates the engine's single-worker throughput. Use this as your reference point.[1, 10]— Baseline + concurrency validation. Shows how the engine scales under parallel load without saturating your storage.[1, 10, 20]— Full suite including saturation test. The 20-worker run will approach or reach your storage write ceiling. Disk utilization is expected to be high.
psql -U postgres -c "CREATE DATABASE dataforge_bench;" psql -U postgres -c "CREATE USER bench_user WITH PASSWORD 'your_password';" psql -U postgres -c "GRANT ALL PRIVILEGES ON DATABASE dataforge_bench TO bench_user;"
Running the Benchmark
Run the full benchmark suite using the configured presets:
export BENCH_DB_PASSWORD=your_password ./bench run --config bench.yaml --tag "my-run-$(date +%Y%m%d)"
To run a single concurrency level (e.g., baseline only):
./bench run --config bench.yaml --concurrency 1
Each run produces a JSON result file in ./runs/ named c{N}.json (e.g., c1.json, c10.json). To generate the HTML report:
./bench-report --runs ./runs/ --out ./bench_report.html --tag "my-run"
1-worker: ~30s · 10-worker: ~140s · 20-worker: ~290s. Your times will vary based on hardware, PostgreSQL configuration, and storage class.
Interpreting Results
Rows/sec — The primary throughput metric. For single-worker runs, this is the engine's clean ingestion rate into your PostgreSQL configuration. For multi-worker runs, the aggregate figure reflects total throughput across all concurrent jobs.
Disk utilization % — Values below 50% indicate disk is not yet the bottleneck. As you add workers, watch this metric. When utilization approaches 90–100%, your storage device is the limiting factor — not the engine. Throughput will plateau or decline slightly, but integrity should remain unaffected.
Write queue depth — Average I/O queue depth across the run. Values above 8–10 indicate the storage subsystem is absorbing more writes than it can service immediately. This is expected behavior at high concurrency and is not a sign of data loss risk.
CPU utilization — DataForge is primarily I/O bound, not CPU bound. High CPU% (90%+) at low disk utilization suggests a CPU-constrained database configuration (e.g., insufficient shared_buffers, synchronous_commit overhead). Review PostgreSQL configuration if you observe this pattern.
Delta — Row-count difference between source and destination after each run. This must be zero. Any non-zero delta indicates a configuration or connectivity issue. The engine does not silently drop rows.
Stage 4 validation — Each run ends with a four-point check: row count, schema conformance, checksum, and job completion status. All must pass. If any validation fails, the run result is flagged and the report highlights the failure.
Expected Results Reference
Reference results at AMD Ryzen 9 7950X / 63.2 GB RAM / NVMe SSD / PostgreSQL 18.0 local socket. Your results will vary based on hardware class, storage type, and PostgreSQL configuration.
| Concurrency | Total Rows | Wall Time | Agg. Rows/sec | CPU Avg | Disk Util | Integrity |
|---|---|---|---|---|---|---|
| 1 worker | 75,814,101 | 30.1s | 2,516,818 | ~12% | ~22% | PASS |
| 10 workers | 758,141,010 | 139.7s | 5,426,774 | 39.3% | 41.5% | PASS |
| 20 workers | 1,516,282,020 | 291.4s | 5,202,719 | ~80% avg 98.3% peak |
54.5% 1,954 MB/s peak |
PASS |
Contact
For questions about benchmark configuration, result interpretation, or to discuss your findings:
Include your bench-report HTML or the ./runs/ directory contents when describing unexpected results. The JSON run files contain the full execution context needed to diagnose configuration-specific behavior.
talk throughput?
Pilot discussions, investor conversations, enterprise architecture review, or technical deep-dives.