[branch-0.9] Cherry pick feat(datafusion): declare Hash partitioning for pure bucket-transform specs#19
Open
toutane wants to merge 3 commits into
Open
[branch-0.9] Cherry pick feat(datafusion): declare Hash partitioning for pure bucket-transform specs#19toutane wants to merge 3 commits into
toutane wants to merge 3 commits into
Conversation
6a07d9d to
92f0f2a
Compare
… specs
Extend scan-time partition detection in IcebergTableProvider so that a
default partition spec whose every field is a `Transform::Bucket(_)` is
exposed to DataFusion as `Partitioning::Hash([source_cols], n)`. This lets
the planner skip a `RepartitionExec` for GROUP BY / joins on the bucket
source column, mirroring the existing identity-transform path.
Correctness: DataFusion's `EquivalenceProperties::is_partition_satisfied`
compares `Partitioning::Hash` against `Distribution::HashPartitioned` by
expression equality only, not by the underlying hash function. Iceberg
`bucket[N]` already co-locates same-source-value rows at the file level
(same value -> same bucket index -> same files); the task distributor
sends every unique bucket index to a single DataFusion partition, so
co-location is preserved at the row level.
- bucketing.rs: add `BucketCol`, `compute_bucket_cols` (pure-bucket only,
rejects spec evolution / mixed transforms / missing source), and a
`PartitionKeys::{Identity, Bucket}` wrapper used by `bucket_tasks`.
- `bucket_tasks` now hashes the `i32` bucket-index slot (always Int32 per
spec) for the bucket variant, keeping the identity branch unchanged.
- `compute_partition_keys` tries identity first, so mixed identity+bucket
specs keep the current identity-only Hash behaviour.
- table/mod.rs::scan(): use `compute_partition_keys` + `column_exprs()`
instead of inlining the identity-only branch.
- Five new tests: pure-bucket Hash declaration, projection excluding the
source, null partition slot fallback, mixed bucket+truncate fallback,
and identity+bucket regression lock.
(cherry picked from commit d4f1170)
3cb825c to
25af0c7
Compare
…tribution `bucket_tasks` used to hash the Int32 bucket-index slot through REPARTITION_RANDOM_STATE before taking `% n_partitions`. The intent was to keep the `Partitioning::Hash` annotation aligned with DataFusion's hash convention, but composing ahash with Iceberg's `bucket[N]` already produced a function distinct from `ahash(source_value)`, so the annotation was never literally honest in either mode. DataFusion's `Partitioning::satisfaction` (datafusion-physical-expr 53.1 partitioning.rs:219-272) only checks expression equality and explicitly documents that the hash function uniformity is *assumed*, not verified. The downstream operators that consume the annotation (AggregateExec SinglePartitioned, HashJoinExec Partitioned) require row co-location by key, which Iceberg's `bucket[N]` already guarantees by being deterministic on the source value. The rehash therefore added a uniform-spread layer without any correctness benefit, but caused birthday-paradox collisions in the `N_buckets ≈ n_partitions` regime: with 8 buckets re-hashed modulo 8, ~75% of runs produced at least one empty scan partition. Replacing the rehash with a deterministic positional linearisation `((idx_1 * N_2 * ... * N_k) + ... + idx_k) % n_partitions` keeps co-location, makes the single-column case the natural Iceberg distribution (`idx % n_partitions`), and bounds the multi-column skew to ±1 task per partition. - bucketing.rs: BucketCol carries `bucket_n`; new `bucket_linear_index` replaces `bucket_hash`; `bucket_tasks` dispatches Identity (unchanged rehash, which is strictly aligned with DataFusion) vs Bucket (linear modulo). Doc comments on `compute_partition_keys` and `bucket_tasks` rewritten to explain the two regimes. - Three new unit tests: deterministic identity mapping when N == n, modulo grouping when N > n, and multi-column linearisation. (cherry picked from commit 9f9a214)
compute_bucket_cols now retains bucket spec fields whose source column survives the output projection instead of dropping the whole spec to UnknownPartitioning. The positional linearisation in bucket_linear_index already iterates only over the retained cols via spec_field_idx, so file- level co-location on a bucket-dimension subset still implies row-level co-location on that subset. Returns None only when no bucket source survives the projection. (cherry picked from commit b48f96d)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of QECO-1260
Cherry pick: toutane@d4f1170 and toutane@9f9a214