[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function by Yizhou-Yang · Pull Request #11651 · apache/gluten

Yizhou-Yang · 2026-02-25T02:58:03Z

What

Add Velox approx_percentile support for Spark.

Why

Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.

How

VeloxApproximatePercentile: A DeclarativeAggregate with 9 aggBufferAttributes matching Velox's KLL sketch layout.
Spark-side KLL implementation (KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.
ApproxPercentileRewriteRule: Rewrites Spark's ApproximatePercentile to the Velox-compatible version.
All 4 fallback modes supported: Full offload, partial fallback, final fallback, full fallback.

Key decisions

Accuracy stored as IntegerType (Spark's original value); Velox computes epsilon = 1.0/accuracy internally.
KLL chosen over GK for Spark-side fallback to maintain intermediate data compatibility with Velox.

Velox dependency

facebookincubator/velox#16320

Related issue: #4889

Testing

Velox uses the KLL sketch algorithm for approx_percentile, while Spark uses the GK (Greenwald-Khanna) algorithm. Both are approximate and produce results within error bounds, but they may select different concrete values at percentile boundaries. For example, for integers 1..1000, the exact 25th percentile is 250.25 — GK returns 250 while KLL may return 251. This difference cannot be eliminated by increasing precision.

Changes Overview

graph TD
    subgraph "Root Cause"
        RC["Velox KLL sketch ≠ Spark GK algorithm<br/>Different approximate values at percentile boundaries"]
    end

    subgraph "VeloxTestSettings.scala — Excludes (4 suites)"
        E1["GlutenApproximatePercentileQuerySuite<br/><b>8 tests excluded</b>"]
        E2["GlutenDataFrameAggregateSuite<br/><b>1 test excluded</b>"]
        E3["GlutenDataFramePivotSuite<br/><b>1 test excluded</b>"]
        E4["GlutenDataFrameSuite<br/><b>1 test excluded</b>"]
    end

    subgraph "GlutenApproximatePercentileQuerySuite.scala — Overrides"
        O["<b>8 tests rewritten</b> with tolerance-based assertions<br/>(kllTolerance = 2)"]
    end

    RC --> E1 & E2 & E3 & E4
    E1 -- "excluded then re-added<br/>via testGluten" --> O

Detailed Breakdown

1. `GlutenApproximatePercentileQuerySuite` — 8 tests excluded & overridden

2. `GlutenDataFrameAggregateSuite` — 1 test excluded

Velox KLL sketch produces different results from Spark's GK algorithm, especially with low accuracy (accuracy=1) |

3. `GlutenDataFramePivotSuite` — 1 test excluded

different approximate values from Spark's GK algorithm in pivot context |

4. `GlutenDataFrameSuite` — 1 test excluded

DataFrame.summary() uses approx_percentile internally; Velox KLL sketch produces different percentile values from Spark's GK algorithm |

github-actions · 2026-02-25T02:58:31Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-02-25T03:06:09Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-02-25T10:21:16Z

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

jinchengchenghh · 2026-02-25T10:23:27Z

Do we need the config? Usually we offload the function to native by default

Yizhou-Yang · 2026-02-25T12:00:06Z

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

added the 16320 and removed the config

github-actions · 2026-02-25T12:01:07Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-02-26T02:46:42Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T03:22:22Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T03:37:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T09:29:51Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T09:32:36Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T09:36:52Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-02T09:46:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-03T03:02:51Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-03T03:09:19Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-03T03:51:23Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2026-03-03T11:33:09Z

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

Yizhou-Yang · 2026-03-03T11:54:58Z

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

done~

github-actions · 2026-03-06T03:09:28Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-18T08:55:40Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-18T11:14:55Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-18T11:26:56Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-03-18T11:31:13Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-03-19T09:14:27Z

@Yizhou-Yang Thanks for your implementation! Recently I am cherrying pick your PR and I find that the KLL implementation in Gluten has only one level and discards items in odd position when compacting. I am wondering if this implementation can meet the accuracy requirement. Will the relative error rate be too high?

Changelog:

Rewrote KLL sketch with proper multi-level compaction — the old implementation only had one level and discarded odd-position items, which is essentially random downsampling with no error bound guarantee. Now it uses the standard KLL algorithm: items across multiple levels, level-0 inserts, sort-and-halve compaction promoting to higher levels, with each level-i item representing 2^i original values.

Merge correctly combines multi-level sketches and re-compacts, instead of simple concatenation + truncation.

Why some tests are disabled:

KLL and Spark's native GK (Greenwald-Khanna) are fundamentally different algorithms. Both satisfy the 1/accuracy error bound, but they produce different concrete values at the same percentile boundary. For example, given integers 1–1000, the exact 25th percentile is 250.25 — GK returns 250, KLL may return 251. Both are correct within the error bound, but a strict equality assertion like assert(result == 250) will fail.

For those tests I added TestGluten in the approxpercentile suite. The off-by one problem can't be simply solved by adding more layers in partial result, I tried to not disable any test when developing, but the current version is the best I can do for now.

The 4 disabled tests (approx_percentile, summary, SPARK-35480, SPARK-32908) all use exact-match assertions against GK's specific output. Rather than modifying upstream Spark tests, we excluded them and added a dedicated GlutenApproximatePercentileQuerySuite with tolerance-based assertions that validate correctness for both algorithms.

There are some other tests in collect_list that assumes approx_percentile will fallback, changed to percentile.

I tested only spark35/spark35smj and their slow versions, hopefully it also passes spark33 etc...

PTAL again... @jinchengchenghh @zhztheplayer

github-actions · 2026-05-09T02:14:40Z

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

jinchengchenghh · 2026-05-11T09:12:02Z

Sorry for missing this PR, looks good to me, do you have further comments? @zhztheplayer

zhztheplayer

👍

jinchengchenghh · 2026-05-12T09:12:17Z

Could you help resolve the conflict? Then we can merge it, thanks! @Yizhou-Yang

github-actions · 2026-05-14T12:07:53Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-05-15T12:18:30Z

Could you help resolve the conflict? Then we can merge it, thanks! @Yizhou-Yang

ready @jinchengchenghh

github-actions · 2026-05-18T02:16:13Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-05-18T12:28:37Z

@jinchengchenghh

sorry was not ready... didn't actually run local tests.
I reran all the local tests so hopefully it works now.

github-actions · 2026-05-19T02:45:38Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-05-19T07:29:38Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-05-19T07:36:58Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-05-20T02:49:15Z

hopefully I fixed spark 40/41 related cases:

PTAL again... @jinchengchenghh

…smatch - Remove incorrect 'approx_percentile' -> 'spark_approx_percentile' mapping in SubstraitParser.cc (Velox registers with empty prefix) - Fix accuracy field type from DoubleType to IntegerType in VeloxApproxPercentile.scala to match Velox intermediate type definition - Add testNameBlackList for 'different column types' test across all spark versions (KLL vs GK algorithm produces off-by-one results) - Add testGluten override with tolerance-based assertions for the 'different column types' test

ClickHouse backend has its own approx_percentile implementation that differs from Velox's KLL sketch. The testGluten overrides in GlutenApproximatePercentileQuerySuite are specifically designed for Velox's KLL sketch behavior and should not run on ClickHouse backend. Add excludeGlutenTest entries in ClickHouseTestSettings for all spark versions (33/34/35/40/41) to skip these Velox-specific tests.

KLL sketch OOM in KllSketchHelper.merge: - Expand worklevels capacity to max(ub, provisionalNumLevels) + 8 to avoid out-of-bounds writes when generalCompress promotes a new top level. - Add bound checks in generalCompress when writing inLevels(level+2) and when promoting currentNumLevels, preventing memory corruption that inflated targetItemCount to GiB scale. - Defensively clamp tmpNumItems, finalNumItems, finalNumLevels and finalCapacity against MAX_KLL_BUFFER_SIZE (1 MiB doubles, ~8 MiB). In valid sketches these clamps are no-ops; only corrupted intermediate state (which previously triggered SparkOutOfMemoryError requesting 11.4 GiB) is bounded. Exclude UT cases that legitimately differ between Velox KLL and Spark GK: - spark40/41: SPARK-32908 (requires Vanilla spark resource files, same reason as spark33/34/35). - spark40/41 GlutenDataFrameStatSuite: 'approximate quantile 2: test relativeError greater than 1' (KLL=510 vs GK=524 on synthetic dataset). - spark33/34 VeloxSQLQueryTestSettings: disable describe-table-column.sql whose ANALYZE COLUMNS histogram bin boundaries depend on percentile_approx results, which differ between KLL and GK on small datasets.

github-actions · 2026-05-23T01:42:52Z

Run Gluten Clickhouse CI on x86

Yizhou-Yang · 2026-05-23T02:09:08Z

Run Gluten Clickhouse CI on x86

github-actions Bot added CORE works for Gluten Core VELOX labels Feb 25, 2026

Yizhou-Yang mentioned this pull request Feb 25, 2026

feat: Add Spark approx_percentile aggregate function facebookincubator/velox#16320

Closed

Yizhou-Yang changed the title ~~feat:support gluten-level approx_percentile~~ [GLUTEN-4889][VL] feat:support gluten-level approx_percentile Feb 25, 2026

github-actions Bot added the CLICKHOUSE label Feb 25, 2026

github-actions Bot added BUILD and removed CORE works for Gluten Core labels Feb 25, 2026

github-actions Bot added the CORE works for Gluten Core label Mar 2, 2026

Yizhou-Yang force-pushed the percentile0225 branch from 61dc2ca to c76cb84 Compare March 2, 2026 03:37

github-actions Bot removed the CORE works for Gluten Core label Mar 2, 2026

Yizhou-Yang force-pushed the percentile0225 branch from 1b402de to b0679ad Compare March 6, 2026 03:08

Yizhou-Yang requested a review from jinchengchenghh March 18, 2026 07:50

Yizhou-Yang force-pushed the percentile0225 branch from e9a5d18 to 48c34b5 Compare March 18, 2026 11:30

github-actions Bot added the stale stale label May 9, 2026

zhztheplayer approved these changes May 11, 2026

View reviewed changes

github-actions Bot removed the stale stale label May 12, 2026

Yizhou-Yang force-pushed the percentile0225 branch from 48c34b5 to 88fa675 Compare May 14, 2026 12:07

Yizhou-Yang force-pushed the percentile0225 branch from 88fa675 to 37ca1bf Compare May 18, 2026 02:15

Yizhou-Yang added 4 commits May 23, 2026 09:33

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function

26276b3

Yizhou-Yang force-pushed the percentile0225 branch from 1e00ccf to 5dd180e Compare May 23, 2026 01:42

Conversation

Yizhou-Yang commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Key decisions

Velox dependency

Testing

Changes Overview

Detailed Breakdown

1. GlutenApproximatePercentileQuerySuite — 8 tests excluded & overridden

2. GlutenDataFrameAggregateSuite — 1 test excluded

3. GlutenDataFramePivotSuite — 1 test excluded

4. GlutenDataFrameSuite — 1 test excluded

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

jinchengchenghh commented Feb 25, 2026

Uh oh!

jinchengchenghh commented Feb 25, 2026

Uh oh!

Yizhou-Yang commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

jinchengchenghh commented Mar 3, 2026

Uh oh!

Yizhou-Yang commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 6, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

github-actions Bot commented Mar 18, 2026

Uh oh!

Yizhou-Yang commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

jinchengchenghh commented May 11, 2026

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented May 12, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Yizhou-Yang commented May 15, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Yizhou-Yang commented Feb 25, 2026 •

edited

Loading

1. `GlutenApproximatePercentileQuerySuite` — 8 tests excluded & overridden

2. `GlutenDataFrameAggregateSuite` — 1 test excluded

3. `GlutenDataFramePivotSuite` — 1 test excluded

4. `GlutenDataFrameSuite` — 1 test excluded

Yizhou-Yang commented Mar 3, 2026 •

edited

Loading

Yizhou-Yang commented Mar 19, 2026 •

edited

Loading

Yizhou-Yang commented May 20, 2026 •

edited

Loading