-
Notifications
You must be signed in to change notification settings - Fork 267
feat: Enable DPP support with native_datafusion scan #3060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Query Plans (selectivity=0.01) |
Note: Spark: FileScan parquet (standard Spark scan) |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3060 +/- ##
============================================
- Coverage 56.12% 54.59% -1.54%
- Complexity 976 1262 +286
============================================
Files 119 167 +48
Lines 11743 15522 +3779
Branches 2251 2574 +323
============================================
+ Hits 6591 8474 +1883
- Misses 4012 5830 +1818
- Partials 1140 1218 +78 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
58f60e7 to
1b6f816
Compare
|
|
||
| // Add runtime filter bounds if available | ||
| // These are pushed down from join operators to enable I/O reduction | ||
| addRuntimeFilterBounds(scan, nativeScanBuilder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataFilters have been pushed down to the native scan.
datafusion-comet/spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala
Lines 116 to 124 in 69b2260
| val dataFilters = new ListBuffer[Expr]() | |
| for (filter <- scan.supportedDataFilters) { | |
| exprToProto(filter, scan.output) match { | |
| case Some(proto) => dataFilters += proto | |
| case _ => | |
| logWarning(s"Unsupported data filter $filter") | |
| } | |
| } | |
| nativeScanBuilder.addAllDataFilters(dataFilters.asJava) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for sharing @wForget - looks like these changes were not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filters are pushed down but native filtering is not implemented, right ? I think those fallback to spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filters are pushed down but native filtering is not implemented, right ? I think those fallback to spark.
There are several possibilities:
- Runtime Filters have been pushed down to the native parquet reader, but the parquet reader is not enabled or does not support them. (Parquet row group filters will be ineffective, but subsequent Filter operator will be effective.)
- Runtime Filters have been generated, but they caused the spark plan to fall back to vanilla spark. (As far as I know, Comet’s support for subqueries may not be complete.)
- Runtime Filters were not generated correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing. I tried native runtime filters execution but do not see any improvements
The benchmark shows Comet is slower than Spark for workloads. This could be due to:
JNI Overhead: Significant cost crossing JVM/native boundary
Spark's Vectorized Reader: Highly optimized for in-memory parquet reading
Runtime Filters Not Effective: Filters may not be pruning row groups because:
Benchmark parquet files don't have bloom filters written
Data is in few row groups (small files)
2b5c120 to
1add4cf
Compare
| withInfo(scanExec, s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled") | ||
| } | ||
|
|
||
| // Native DataFusion doesn't support subqueries/dynamic pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With or without this if condition - I am seeing same benchmarking, not sure when this is used .
|
Update: Although I could not achieve, what I was trying to do but the takeaway is : Even for non-partitioned tables where DPP operates via dataFilters (not partitionFilters), Comet's native DataFusion scan shows: Massive I/O reduction through efficient filter pushdown |
Ref #3053
Rationale for this change
Dynamic Partition Pruning (DPP) was previously disabled for native_datafusion scan due to subquery handling concerns. However, Spark's DPP mechanism already evaluates dynamic filters and provides pre-filtered partition lists via dynamicallySelectedPartitions. This PR enables DPP support, achieving up to 301x I/O reduction by leveraging Spark's existing DPP infrastructure with native scan.
What changes are included in this PR?
How are these changes tested?
Unit tests
Add CometDPPSuite - test suite validating DPP with native scan
Add CometDPPBenchmark - benchmark showing I/O reduction metrics
Run the benchmark using