[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651
[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651Yizhou-Yang wants to merge 4 commits into
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28 |
|
Do we need the config? Usually we offload the function to native by default |
added the 16320 and removed the config |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
61dc2ca to
c76cb84
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
5 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately. |
done~ |
1b402de to
b0679ad
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
e9a5d18 to
48c34b5
Compare
|
Run Gluten Clickhouse CI on x86 |
PTAL again... @jinchengchenghh @zhztheplayer |
|
This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
|
Sorry for missing this PR, looks good to me, do you have further comments? @zhztheplayer |
|
Could you help resolve the conflict? Then we can merge it, thanks! @Yizhou-Yang |
48c34b5 to
88fa675
Compare
|
Run Gluten Clickhouse CI on x86 |
ready @jinchengchenghh |
88fa675 to
37ca1bf
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
hopefully I fixed spark 40/41 related cases: PTAL again... @jinchengchenghh |
…smatch - Remove incorrect 'approx_percentile' -> 'spark_approx_percentile' mapping in SubstraitParser.cc (Velox registers with empty prefix) - Fix accuracy field type from DoubleType to IntegerType in VeloxApproxPercentile.scala to match Velox intermediate type definition - Add testNameBlackList for 'different column types' test across all spark versions (KLL vs GK algorithm produces off-by-one results) - Add testGluten override with tolerance-based assertions for the 'different column types' test
ClickHouse backend has its own approx_percentile implementation that differs from Velox's KLL sketch. The testGluten overrides in GlutenApproximatePercentileQuerySuite are specifically designed for Velox's KLL sketch behavior and should not run on ClickHouse backend. Add excludeGlutenTest entries in ClickHouseTestSettings for all spark versions (33/34/35/40/41) to skip these Velox-specific tests.
KLL sketch OOM in KllSketchHelper.merge: - Expand worklevels capacity to max(ub, provisionalNumLevels) + 8 to avoid out-of-bounds writes when generalCompress promotes a new top level. - Add bound checks in generalCompress when writing inLevels(level+2) and when promoting currentNumLevels, preventing memory corruption that inflated targetItemCount to GiB scale. - Defensively clamp tmpNumItems, finalNumItems, finalNumLevels and finalCapacity against MAX_KLL_BUFFER_SIZE (1 MiB doubles, ~8 MiB). In valid sketches these clamps are no-ops; only corrupted intermediate state (which previously triggered SparkOutOfMemoryError requesting 11.4 GiB) is bounded. Exclude UT cases that legitimately differ between Velox KLL and Spark GK: - spark40/41: SPARK-32908 (requires Vanilla spark resource files, same reason as spark33/34/35). - spark40/41 GlutenDataFrameStatSuite: 'approximate quantile 2: test relativeError greater than 1' (KLL=510 vs GK=524 on synthetic dataset). - spark33/34 VeloxSQLQueryTestSettings: disable describe-table-column.sql whose ANALYZE COLUMNS histogram bin boundaries depend on percentile_approx results, which differ between KLL and GK on small datasets.
1e00ccf to
5dd180e
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |


What
Add Velox
approx_percentilesupport for Spark.Why
Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.
How
VeloxApproximatePercentile: ADeclarativeAggregatewith 9aggBufferAttributesmatching Velox's KLL sketch layout.KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.ApproxPercentileRewriteRule: Rewrites Spark'sApproximatePercentileto the Velox-compatible version.Key decisions
IntegerType(Spark's original value); Velox computesepsilon = 1.0/accuracyinternally.Velox dependency
facebookincubator/velox#16320
Related issue: #4889
Testing
Velox uses the KLL sketch algorithm for
approx_percentile, while Spark uses the GK (Greenwald-Khanna) algorithm. Both are approximate and produce results within error bounds, but they may select different concrete values at percentile boundaries. For example, for integers1..1000, the exact 25th percentile is250.25— GK returns250while KLL may return251. This difference cannot be eliminated by increasing precision.Changes Overview
graph TD subgraph "Root Cause" RC["Velox KLL sketch ≠ Spark GK algorithm<br/>Different approximate values at percentile boundaries"] end subgraph "VeloxTestSettings.scala — Excludes (4 suites)" E1["GlutenApproximatePercentileQuerySuite<br/><b>8 tests excluded</b>"] E2["GlutenDataFrameAggregateSuite<br/><b>1 test excluded</b>"] E3["GlutenDataFramePivotSuite<br/><b>1 test excluded</b>"] E4["GlutenDataFrameSuite<br/><b>1 test excluded</b>"] end subgraph "GlutenApproximatePercentileQuerySuite.scala — Overrides" O["<b>8 tests rewritten</b> with tolerance-based assertions<br/>(kllTolerance = 2)"] end RC --> E1 & E2 & E3 & E4 E1 -- "excluded then re-added<br/>via testGluten" --> ODetailed Breakdown
1.
GlutenApproximatePercentileQuerySuite— 8 tests excluded & overridden2.
GlutenDataFrameAggregateSuite— 1 test excludedVelox KLL sketch produces different results from Spark's GK algorithm, especially with low accuracy (
accuracy=1) |3.
GlutenDataFramePivotSuite— 1 test excludeddifferent approximate values from Spark's GK algorithm in pivot context |
4.
GlutenDataFrameSuite— 1 test excludedDataFrame.summary()usesapprox_percentileinternally; Velox KLL sketch produces different percentile values from Spark's GK algorithm |