generate-import-fixtures: add TPC-H fixture generation tool by sravotto · Pull Request #167473 · cockroachdb/cockroach

sravotto · 2026-04-03T14:29:54Z

Add a new tool that converts TPC-H dbgen pipe-delimited CSV files into AVRO and Parquet formats for use in IMPORT roachtests.

Commit 1: Introduce the tool with a pluggable output format architecture and AVRO support (OCF files with Snappy compression plus binary records and schema files).
Commit 2: Add Parquet as a second output format using the Apache Arrow library with Snappy compression, mapping TPC-H types to native Parquet types.
Commit 3: Replace the monolithic in-memory approach with streaming batched writes (10K rows/batch), reducing peak memory from 10+ GB to ~15-20 MB for large shards.
Commit 4: Add the region and nation dimension tables, with support for single-file (unsharded) dbgen output.

Part of: #164461

Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com

trunk-io · 2026-04-03T14:29:58Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

blathers-crl · 2026-04-03T14:29:59Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-04-03T14:30:15Z

This change is

Add a tool to convert TPC-H pipe-delimited CSV fixtures into other file formats for use in IMPORT roachtests. The tool is designed with a pluggable output format architecture so new formats can be added alongside AVRO. Currently supports AVRO output, producing both OCF (Object Container Format) files with snappy compression and binary records files. Schema files are also written for use with the data_as_binary_records option. Usage: go run ./pkg/cmd/generate-import-fixtures \ --format=avro \ --input-dir=/tmp/tpch-csv \ --output-dir=/tmp/fixtures \ --tables=customer,supplier Part of: cockroachdb#164461 Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

Add Parquet as a second output format for the fixture generation tool, alongside the existing AVRO format. Parquet files are written with Snappy compression using the Apache Arrow Go library. The schema maps TPC-H column types to Parquet types: Long to INT64, Double to FLOAT64, String to BYTE_ARRAY with String logical type, and Date to INT32 with DATE logical type (days since Unix epoch). All columns are non-nullable, matching the TPC-H specification. The schema is embedded in the Parquet file footer, so no separate schema file is needed. Epic: CRDB-62435 Release note: None

Replace the monolithic WriteFiles approach, which loaded all rows of a shard into memory before writing, with a streaming batched writer pattern. The previous approach caused 10+ GB peak memory for large shards (e.g. lineitem at SF100: ~9.4M rows/shard) due to holding the full []map[string]interface{} slice plus format-specific typed column arrays. The new design introduces a FormatWriter interface with WriteBatch and Close methods. The OutputFormat.NewWriter method opens output files and returns a writer, and processShard reads the CSV line-by-line, parsing rows into a reusable batch of 10,000 rows. When the batch is full it is flushed via WriteBatch, then reset with batch[:0] to reuse the backing array. This bounds peak memory to ~15-20 MB regardless of shard size. For Parquet, each WriteBatch call creates a new row group, with typed column arrays sized to the batch (10K) instead of the full shard. For Avro, each batch is appended to the OCF writer and binary-encoded incrementally. The parquet writer's Close no longer explicitly closes the os.File because the Arrow parquet library's Writer.Close already closes the underlying sink via defer. Epic: CRDB-62435 Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add the two remaining TPC-H dimension tables (region with 5 rows, nation with 25 rows) so the tool can generate complete fixture sets. Since dbgen produces these small tables as a single file (e.g. region.tbl) rather than sharded files (e.g. region.tbl.1..8), extract the shard logic into a shardFiles() function that returns one entry for single-file tables and 8 entries for sharded tables. Epic: CRDB-62435 Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sravotto force-pushed the sr8_parquet_fixtures branch from 8aeaf37 to 5a09248 Compare April 3, 2026 14:45

mw5h and others added 4 commits April 3, 2026 10:46

sravotto force-pushed the sr8_parquet_fixtures branch from 5a09248 to bdf6ddc Compare April 3, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generate-import-fixtures: add TPC-H fixture generation tool #167473

generate-import-fixtures: add TPC-H fixture generation tool #167473
sravotto wants to merge 4 commits intocockroachdb:masterfrom
sravotto:sr8_parquet_fixtures

sravotto commented Apr 3, 2026

Uh oh!

trunk-io bot commented Apr 3, 2026

Uh oh!

blathers-crl bot commented Apr 3, 2026

Uh oh!

cockroach-teamcity commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sravotto commented Apr 3, 2026

Uh oh!

trunk-io bot commented Apr 3, 2026

Uh oh!

blathers-crl bot commented Apr 3, 2026

Uh oh!

cockroach-teamcity commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants