generate-import-fixtures: add TPC-H fixture generation tool #167473
Draft
sravotto wants to merge 4 commits intocockroachdb:masterfrom
Draft
generate-import-fixtures: add TPC-H fixture generation tool #167473sravotto wants to merge 4 commits intocockroachdb:masterfrom
sravotto wants to merge 4 commits intocockroachdb:masterfrom
Conversation
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
8aeaf37 to
5a09248
Compare
Add a tool to convert TPC-H pipe-delimited CSV fixtures into other file
formats for use in IMPORT roachtests. The tool is designed with a pluggable
output format architecture so new formats can be added alongside AVRO.
Currently supports AVRO output, producing both OCF (Object Container
Format) files with snappy compression and binary records files. Schema
files are also written for use with the data_as_binary_records option.
Usage:
go run ./pkg/cmd/generate-import-fixtures \
--format=avro \
--input-dir=/tmp/tpch-csv \
--output-dir=/tmp/fixtures \
--tables=customer,supplier
Part of: cockroachdb#164461
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add Parquet as a second output format for the fixture generation tool, alongside the existing AVRO format. Parquet files are written with Snappy compression using the Apache Arrow Go library. The schema maps TPC-H column types to Parquet types: Long to INT64, Double to FLOAT64, String to BYTE_ARRAY with String logical type, and Date to INT32 with DATE logical type (days since Unix epoch). All columns are non-nullable, matching the TPC-H specification. The schema is embedded in the Parquet file footer, so no separate schema file is needed. Epic: CRDB-62435 Release note: None
Replace the monolithic WriteFiles approach, which loaded all rows of a
shard into memory before writing, with a streaming batched writer
pattern. The previous approach caused 10+ GB peak memory for large
shards (e.g. lineitem at SF100: ~9.4M rows/shard) due to holding the
full []map[string]interface{} slice plus format-specific typed column
arrays.
The new design introduces a FormatWriter interface with WriteBatch and
Close methods. The OutputFormat.NewWriter method opens output files and
returns a writer, and processShard reads the CSV line-by-line, parsing
rows into a reusable batch of 10,000 rows. When the batch is full it is
flushed via WriteBatch, then reset with batch[:0] to reuse the backing
array. This bounds peak memory to ~15-20 MB regardless of shard size.
For Parquet, each WriteBatch call creates a new row group, with typed
column arrays sized to the batch (10K) instead of the full shard. For
Avro, each batch is appended to the OCF writer and binary-encoded
incrementally.
The parquet writer's Close no longer explicitly closes the os.File
because the Arrow parquet library's Writer.Close already closes the
underlying sink via defer.
Epic: CRDB-62435
Release note: None
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the two remaining TPC-H dimension tables (region with 5 rows, nation with 25 rows) so the tool can generate complete fixture sets. Since dbgen produces these small tables as a single file (e.g. region.tbl) rather than sharded files (e.g. region.tbl.1..8), extract the shard logic into a shardFiles() function that returns one entry for single-file tables and 8 entries for sharded tables. Epic: CRDB-62435 Release note: None Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5a09248 to
bdf6ddc
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a new tool that converts TPC-H dbgen pipe-delimited CSV files into AVRO and Parquet formats for use in IMPORT roachtests.
Part of: #164461
Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com