feat: thread data_size through decode pipeline by westonpace · Pull Request #6391 · lance-format/lance

westonpace · 2026-04-02T17:08:41Z

Summary

Threads accurate data_size (in bytes) from DataBlock::data_size() at the encoding layer through the full decode pipeline to the final RecordBatch
Implements DataBlock::data_size() for Struct and Dictionary variants (were todo!())
Uses the accurate data size for the "batch is too large" warning instead of Arrow's get_array_memory_size(), which over-reports due to shared page buffers
Changes DecodeArrayTask::decode() to return (ArrayRef, u64) so data size flows through naturally

Test plan

All 364 existing lance-encoding tests pass
cargo clippy -p lance-encoding --tests -- -D warnings clean
cargo clippy -p lance-file --tests -- -D warnings clean
cargo fmt --all -- --check clean

🤖 Generated with Claude Code

github-actions · 2026-04-02T17:09:01Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

The decode pipeline now tracks the actual data size (in bytes) of decoded arrays from the encoding layer (DataBlock::data_size()) through to the final RecordBatch. This replaces the use of Arrow's get_array_memory_size() for the "batch is too large" warning, providing more accurate byte counts that don't over-report due to shared page buffers. Changes: - Add data_size field to DecodedArray - Implement DataBlock::data_size() for Struct and Dictionary (were todo!()) - Change DecodeArrayTask::decode() to return (ArrayRef, u64) - Populate data_size in all 5 StructuralDecodeArrayTask implementations - Update all 6 legacy DecodeArrayTask implementations to return (arr, 0) - Thread data_size through NextDecodeTask::into_batch() - Use data_size for the batch-too-large warning instead of Arrow overhead Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

westonpace · 2026-04-02T17:29:36Z

rust/lance-encoding/src/decoder.rs

                    // thread for a long time. By spawning it as a new task, we allow Tokio's
                    // worker threads to keep making progress.
-                    tokio::spawn(async move { next_task.into_batch(emitted_batch_size_warning) })
+                    let (batch, _data_size) =


I plan on using this in a future PR soon

westonpace · 2026-04-02T17:31:41Z

Ostensibly this PR stands on its own because it makes the warning log message that we print more accurate. The true reason for the change though is to enable #6388 to be more accurate.

codecov · 2026-04-02T18:01:18Z

Codecov Report

❌ Patch coverage is 75.82418% with 22 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...-encoding/src/previous/encodings/logical/binary.rs	0.00%	8 Missing ⚠️
rust/lance-encoding/src/decoder.rs	80.00%	0 Missing and 5 partials ⚠️
...ce-encoding/src/previous/encodings/logical/list.rs	55.55%	2 Missing and 2 partials ⚠️
...-encoding/src/encodings/logical/fixed_size_list.rs	80.00%	0 Missing and 1 partial ⚠️
rust/lance-encoding/src/encodings/logical/list.rs	83.33%	0 Missing and 1 partial ⚠️
rust/lance-encoding/src/encodings/logical/map.rs	80.00%	0 Missing and 1 partial ⚠️
...-encoding/src/previous/encodings/logical/struct.rs	87.50%	0 Missing and 1 partial ⚠️
...ding/src/previous/encodings/physical/dictionary.rs	0.00%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

wjones127

I'm excited for byte-size batches!

westonpace changed the title ~~Thread data_size through decode pipeline~~ feat: thread data_size through decode pipeline Apr 2, 2026

github-actions bot added the enhancement New feature or request label Apr 2, 2026

westonpace commented Apr 2, 2026

View reviewed changes

westonpace force-pushed the feat/thread-data-size-through-decode branch from dbd186b to ac66494 Compare April 2, 2026 17:31

wjones127 approved these changes Apr 2, 2026

View reviewed changes

westonpace merged commit 36b344f into lance-format:main Apr 2, 2026
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: thread data_size through decode pipeline#6391

feat: thread data_size through decode pipeline#6391
westonpace merged 1 commit intolance-format:mainfrom
westonpace:feat/thread-data-size-through-decode

westonpace commented Apr 2, 2026

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

westonpace Apr 2, 2026

Uh oh!

westonpace commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

wjones127 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

westonpace commented Apr 2, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 2, 2026

Uh oh!

westonpace Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace commented Apr 2, 2026

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants