Skip to content

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651

Open
Yizhou-Yang wants to merge 4 commits into
apache:mainfrom
Yizhou-Yang:percentile0225
Open

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651
Yizhou-Yang wants to merge 4 commits into
apache:mainfrom
Yizhou-Yang:percentile0225

Conversation

@Yizhou-Yang
Copy link
Copy Markdown

@Yizhou-Yang Yizhou-Yang commented Feb 25, 2026

What

Add Velox approx_percentile support for Spark.

Why

Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.

How

  • VeloxApproximatePercentile: A DeclarativeAggregate with 9 aggBufferAttributes matching Velox's KLL sketch layout.
  • Spark-side KLL implementation (KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.
  • ApproxPercentileRewriteRule: Rewrites Spark's ApproximatePercentile to the Velox-compatible version.
  • All 4 fallback modes supported: Full offload, partial fallback, final fallback, full fallback.

Key decisions

  • Accuracy stored as IntegerType (Spark's original value); Velox computes epsilon = 1.0/accuracy internally.
  • KLL chosen over GK for Spark-side fallback to maintain intermediate data compatibility with Velox.

Velox dependency

facebookincubator/velox#16320


Related issue: #4889


Testing

Velox uses the KLL sketch algorithm for approx_percentile, while Spark uses the GK (Greenwald-Khanna) algorithm. Both are approximate and produce results within error bounds, but they may select different concrete values at percentile boundaries. For example, for integers 1..1000, the exact 25th percentile is 250.25 — GK returns 250 while KLL may return 251. This difference cannot be eliminated by increasing precision.

Changes Overview

graph TD
    subgraph "Root Cause"
        RC["Velox KLL sketch ≠ Spark GK algorithm<br/>Different approximate values at percentile boundaries"]
    end

    subgraph "VeloxTestSettings.scala — Excludes (4 suites)"
        E1["GlutenApproximatePercentileQuerySuite<br/><b>8 tests excluded</b>"]
        E2["GlutenDataFrameAggregateSuite<br/><b>1 test excluded</b>"]
        E3["GlutenDataFramePivotSuite<br/><b>1 test excluded</b>"]
        E4["GlutenDataFrameSuite<br/><b>1 test excluded</b>"]
    end

    subgraph "GlutenApproximatePercentileQuerySuite.scala — Overrides"
        O["<b>8 tests rewritten</b> with tolerance-based assertions<br/>(kllTolerance = 2)"]
    end

    RC --> E1 & E2 & E3 & E4
    E1 -- "excluded then re-added<br/>via testGluten" --> O
Loading

Detailed Breakdown

1. GlutenApproximatePercentileQuerySuite — 8 tests excluded & overridden

2. GlutenDataFrameAggregateSuite — 1 test excluded

Velox KLL sketch produces different results from Spark's GK algorithm, especially with low accuracy (accuracy=1) |

3. GlutenDataFramePivotSuite — 1 test excluded

different approximate values from Spark's GK algorithm in pivot context |

4. GlutenDataFrameSuite — 1 test excluded

DataFrame.summary() uses approx_percentile internally; Velox KLL sketch produces different percentile values from Spark's GK algorithm |

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels Feb 25, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yizhou-Yang Yizhou-Yang changed the title feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat:support gluten-level approx_percentile Feb 25, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Copy Markdown
Contributor

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

@jinchengchenghh
Copy link
Copy Markdown
Contributor

Do we need the config? Usually we offload the function to native by default

@Yizhou-Yang
Copy link
Copy Markdown
Author

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

added the 16320 and removed the config

@github-actions github-actions Bot added BUILD and removed CORE works for Gluten Core labels Feb 25, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot removed the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

5 similar comments
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Copy Markdown
Contributor

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

@Yizhou-Yang
Copy link
Copy Markdown
Author

Yizhou-Yang commented Mar 3, 2026

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

done~

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Copy Markdown
Author

Yizhou-Yang commented Mar 19, 2026

@Yizhou-Yang Thanks for your implementation! Recently I am cherrying pick your PR and I find that the KLL implementation in Gluten has only one level and discards items in odd position when compacting. I am wondering if this implementation can meet the accuracy requirement. Will the relative error rate be too high?

Changelog:

  1. Rewrote KLL sketch with proper multi-level compaction — the old implementation only had one level and discarded odd-position items, which is essentially random downsampling with no error bound guarantee. Now it uses the standard KLL algorithm: items across multiple levels, level-0 inserts, sort-and-halve compaction promoting to higher levels, with each level-i item representing 2^i original values.
  2. Merge correctly combines multi-level sketches and re-compacts, instead of simple concatenation + truncation.

Why some tests are disabled:

KLL and Spark's native GK (Greenwald-Khanna) are fundamentally different algorithms. Both satisfy the 1/accuracy error bound, but they produce different concrete values at the same percentile boundary. For example, given integers 1–1000, the exact 25th percentile is 250.25 — GK returns 250, KLL may return 251. Both are correct within the error bound, but a strict equality assertion like assert(result == 250) will fail.

For those tests I added TestGluten in the approxpercentile suite. The off-by one problem can't be simply solved by adding more layers in partial result, I tried to not disable any test when developing, but the current version is the best I can do for now.

The 4 disabled tests (approx_percentile, summary, SPARK-35480, SPARK-32908) all use exact-match assertions against GK's specific output. Rather than modifying upstream Spark tests, we excluded them and added a dedicated GlutenApproximatePercentileQuerySuite with tolerance-based assertions that validate correctness for both algorithms.

There are some other tests in collect_list that assumes approx_percentile will fallback, changed to percentile.

I tested only spark35/spark35smj and their slow versions, hopefully it also passes spark33 etc...

PTAL again... @jinchengchenghh @zhztheplayer

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions Bot added the stale stale label May 9, 2026
@jinchengchenghh
Copy link
Copy Markdown
Contributor

Sorry for missing this PR, looks good to me, do you have further comments? @zhztheplayer

Copy link
Copy Markdown
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@github-actions github-actions Bot removed the stale stale label May 12, 2026
@jinchengchenghh
Copy link
Copy Markdown
Contributor

Could you help resolve the conflict? Then we can merge it, thanks! @Yizhou-Yang

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Copy Markdown
Author

Could you help resolve the conflict? Then we can merge it, thanks! @Yizhou-Yang

ready @jinchengchenghh

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Copy Markdown
Author

@jinchengchenghh

sorry was not ready... didn't actually run local tests.
I reran all the local tests so hopefully it works now.

Clipboard_Screenshot_1779107274

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

2 similar comments
@Yizhou-Yang
Copy link
Copy Markdown
Author

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Copy Markdown
Author

Yizhou-Yang commented May 20, 2026

hopefully I fixed spark 40/41 related cases:
Clipboard_Screenshot_1779245317

PTAL again... @jinchengchenghh

…smatch

- Remove incorrect 'approx_percentile' -> 'spark_approx_percentile' mapping
  in SubstraitParser.cc (Velox registers with empty prefix)
- Fix accuracy field type from DoubleType to IntegerType in
  VeloxApproxPercentile.scala to match Velox intermediate type definition
- Add testNameBlackList for 'different column types' test across all spark
  versions (KLL vs GK algorithm produces off-by-one results)
- Add testGluten override with tolerance-based assertions for the
  'different column types' test
ClickHouse backend has its own approx_percentile implementation that
differs from Velox's KLL sketch. The testGluten overrides in
GlutenApproximatePercentileQuerySuite are specifically designed for
Velox's KLL sketch behavior and should not run on ClickHouse backend.

Add excludeGlutenTest entries in ClickHouseTestSettings for all
spark versions (33/34/35/40/41) to skip these Velox-specific tests.
KLL sketch OOM in KllSketchHelper.merge:
- Expand worklevels capacity to max(ub, provisionalNumLevels) + 8 to avoid
  out-of-bounds writes when generalCompress promotes a new top level.
- Add bound checks in generalCompress when writing inLevels(level+2) and
  when promoting currentNumLevels, preventing memory corruption that
  inflated targetItemCount to GiB scale.
- Defensively clamp tmpNumItems, finalNumItems, finalNumLevels and
  finalCapacity against MAX_KLL_BUFFER_SIZE (1 MiB doubles, ~8 MiB).
  In valid sketches these clamps are no-ops; only corrupted intermediate
  state (which previously triggered SparkOutOfMemoryError requesting
  11.4 GiB) is bounded.

Exclude UT cases that legitimately differ between Velox KLL and Spark GK:
- spark40/41: SPARK-32908 (requires Vanilla spark resource files,
  same reason as spark33/34/35).
- spark40/41 GlutenDataFrameStatSuite: 'approximate quantile 2: test
  relativeError greater than 1' (KLL=510 vs GK=524 on synthetic dataset).
- spark33/34 VeloxSQLQueryTestSettings: disable describe-table-column.sql
  whose ANALYZE COLUMNS histogram bin boundaries depend on percentile_approx
  results, which differ between KLL and GK on small datasets.
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@Yizhou-Yang
Copy link
Copy Markdown
Author

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants