Skip to content

[branch-0.9] Cherry pick feat(datafusion): declare Hash partitioning for pure bucket-transform specs#19

Open
toutane wants to merge 3 commits into
branch-0.9from
branch-0.9-cherry-pick-10
Open

[branch-0.9] Cherry pick feat(datafusion): declare Hash partitioning for pure bucket-transform specs#19
toutane wants to merge 3 commits into
branch-0.9from
branch-0.9-cherry-pick-10

Conversation

@toutane
Copy link
Copy Markdown

@toutane toutane commented May 5, 2026

Part of QECO-1260

Cherry pick: toutane@d4f1170 and toutane@9f9a214

@toutane toutane force-pushed the branch-0.9-cherry-pick-9 branch from 6a07d9d to 92f0f2a Compare May 11, 2026 11:15
Base automatically changed from branch-0.9-cherry-pick-9 to branch-0.9 May 11, 2026 11:16
… specs

Extend scan-time partition detection in IcebergTableProvider so that a
default partition spec whose every field is a `Transform::Bucket(_)` is
exposed to DataFusion as `Partitioning::Hash([source_cols], n)`. This lets
the planner skip a `RepartitionExec` for GROUP BY / joins on the bucket
source column, mirroring the existing identity-transform path.

Correctness: DataFusion's `EquivalenceProperties::is_partition_satisfied`
compares `Partitioning::Hash` against `Distribution::HashPartitioned` by
expression equality only, not by the underlying hash function. Iceberg
`bucket[N]` already co-locates same-source-value rows at the file level
(same value -> same bucket index -> same files); the task distributor
sends every unique bucket index to a single DataFusion partition, so
co-location is preserved at the row level.

- bucketing.rs: add `BucketCol`, `compute_bucket_cols` (pure-bucket only,
  rejects spec evolution / mixed transforms / missing source), and a
  `PartitionKeys::{Identity, Bucket}` wrapper used by `bucket_tasks`.
- `bucket_tasks` now hashes the `i32` bucket-index slot (always Int32 per
  spec) for the bucket variant, keeping the identity branch unchanged.
- `compute_partition_keys` tries identity first, so mixed identity+bucket
  specs keep the current identity-only Hash behaviour.
- table/mod.rs::scan(): use `compute_partition_keys` + `column_exprs()`
  instead of inlining the identity-only branch.
- Five new tests: pure-bucket Hash declaration, projection excluding the
  source, null partition slot fallback, mixed bucket+truncate fallback,
  and identity+bucket regression lock.

(cherry picked from commit d4f1170)
@toutane toutane force-pushed the branch-0.9-cherry-pick-10 branch from 3cb825c to 25af0c7 Compare May 12, 2026 09:33
…tribution

`bucket_tasks` used to hash the Int32 bucket-index slot through
REPARTITION_RANDOM_STATE before taking `% n_partitions`. The intent was
to keep the `Partitioning::Hash` annotation aligned with DataFusion's
hash convention, but composing ahash with Iceberg's `bucket[N]` already
produced a function distinct from `ahash(source_value)`, so the
annotation was never literally honest in either mode. DataFusion's
`Partitioning::satisfaction` (datafusion-physical-expr 53.1
partitioning.rs:219-272) only checks expression equality and explicitly
documents that the hash function uniformity is *assumed*, not verified.
The downstream operators that consume the annotation (AggregateExec
SinglePartitioned, HashJoinExec Partitioned) require row co-location by
key, which Iceberg's `bucket[N]` already guarantees by being
deterministic on the source value.

The rehash therefore added a uniform-spread layer without any
correctness benefit, but caused birthday-paradox collisions in the
`N_buckets ≈ n_partitions` regime: with 8 buckets re-hashed modulo 8,
~75% of runs produced at least one empty scan partition. Replacing the
rehash with a deterministic positional linearisation
`((idx_1 * N_2 * ... * N_k) + ... + idx_k) % n_partitions` keeps
co-location, makes the single-column case the natural Iceberg
distribution (`idx % n_partitions`), and bounds the multi-column skew
to ±1 task per partition.

- bucketing.rs: BucketCol carries `bucket_n`; new `bucket_linear_index`
  replaces `bucket_hash`; `bucket_tasks` dispatches Identity (unchanged
  rehash, which is strictly aligned with DataFusion) vs Bucket (linear
  modulo). Doc comments on `compute_partition_keys` and `bucket_tasks`
  rewritten to explain the two regimes.
- Three new unit tests: deterministic identity mapping when N == n,
  modulo grouping when N > n, and multi-column linearisation.

(cherry picked from commit 9f9a214)
@toutane toutane marked this pull request as ready for review May 13, 2026 13:59
compute_bucket_cols now retains bucket spec fields whose source column
survives the output projection instead of dropping the whole spec to
UnknownPartitioning. The positional linearisation in bucket_linear_index
already iterates only over the retained cols via spec_field_idx, so file-
level co-location on a bucket-dimension subset still implies row-level
co-location on that subset. Returns None only when no bucket source
survives the projection.

(cherry picked from commit b48f96d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant