Speed-up Parquet data generation#10
Open
wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
Open
Speed-up Parquet data generation#10wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
wolfgang-desalvador wants to merge 1 commit intomlcommons:mainfrom
Conversation
Author
|
This method needs to be validated @russfellows @wvaske @FileSystemGuy since it would require from the processes the ability to keep in memory the whole 3+ GiB parquet file. |
|
This looks like a good change. I will try this out, and see if there are any further optimizations that can be made as well. Thanks Wolfgang |
FileSystemGuy
approved these changes
Apr 9, 2026
russfellows
added a commit
to russfellows/dlio_benchmark
that referenced
this pull request
Apr 10, 2026
… into row groups Based on mlcommons#10 (Wolfgang De Salvador). Generate all column data in one pass before the batch loop, then use pa.Table.slice() (zero-copy in Arrow) to produce each row-group batch. Reduces generation call overhead from (num_batches × num_columns) to just num_columns calls. For a file with 10 batches and 5 columns this is a 10× reduction in gen_random_tensor calls. Improvement over upstream PR: added explicit memory trade-off comment and clarified the zero-copy slice semantics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request optimizes the data generation process in
parquet_generator.pyby reducing redundant function calls and improving batch processing efficiency. The main change is to pre-generate all column data for the entire file before batching, which reduces overhead and leverages zero-copy slicing for batch creation.Performance optimizations:
_generate_batch_columnsor_generate_legacy_batch, reducing the number of function calls from (num_batches * num_columns) to just num_columns.full_table.slice(...)), which is more efficient and avoids repeated data generation.