feat: thread data_size through decode pipeline#6391
feat: thread data_size through decode pipeline#6391westonpace merged 1 commit intolance-format:mainfrom
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
The decode pipeline now tracks the actual data size (in bytes) of decoded arrays from the encoding layer (DataBlock::data_size()) through to the final RecordBatch. This replaces the use of Arrow's get_array_memory_size() for the "batch is too large" warning, providing more accurate byte counts that don't over-report due to shared page buffers. Changes: - Add data_size field to DecodedArray - Implement DataBlock::data_size() for Struct and Dictionary (were todo!()) - Change DecodeArrayTask::decode() to return (ArrayRef, u64) - Populate data_size in all 5 StructuralDecodeArrayTask implementations - Update all 6 legacy DecodeArrayTask implementations to return (arr, 0) - Thread data_size through NextDecodeTask::into_batch() - Use data_size for the batch-too-large warning instead of Arrow overhead Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| // thread for a long time. By spawning it as a new task, we allow Tokio's | ||
| // worker threads to keep making progress. | ||
| tokio::spawn(async move { next_task.into_batch(emitted_batch_size_warning) }) | ||
| let (batch, _data_size) = |
There was a problem hiding this comment.
I plan on using this in a future PR soon
|
Ostensibly this PR stands on its own because it makes the warning log message that we print more accurate. The true reason for the change though is to enable #6388 to be more accurate. |
dbd186b to
ac66494
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
wjones127
left a comment
There was a problem hiding this comment.
I'm excited for byte-size batches!
Summary
data_size(in bytes) fromDataBlock::data_size()at the encoding layer through the full decode pipeline to the finalRecordBatchDataBlock::data_size()forStructandDictionaryvariants (weretodo!())get_array_memory_size(), which over-reports due to shared page buffersDecodeArrayTask::decode()to return(ArrayRef, u64)so data size flows through naturallyTest plan
lance-encodingtests passcargo clippy -p lance-encoding --tests -- -D warningscleancargo clippy -p lance-file --tests -- -D warningscleancargo fmt --all -- --checkclean🤖 Generated with Claude Code