What is the problem the feature request solves?
Runtime Filters is a high-performance optimization that reduces I/O by 90% and improves query performance for join-heavy workloads. The feature filters data at scan time using lightweight data structures constructed during hash join build phases.
Runtime filters are lightweight data structures (IN sets, Min/Max bounds, Bloom filters) built during hash join build phases and pushed down to scan operators to filter data before reading from storage.
Example:
-- User writes standard SQL
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';
- Comet will detect join opportunity
- Builds filter from
customers table during join build
- Applies filter to
orders scan automatically
- Query runs faster with less I/O
Filter Types
-
IN Filter (Small cardinality <1000)
-
Min/Max Filter (Numeric/date types)
-
Bloom Filter (Large cardinality, future)
- Probabilistic data structure
The system should automatically selects the optimal filter type:
- Numeric/date → Min/Max Filter (most efficient)
- Small cardinality → IN Filter
- Large cardinality → Bloom Filter
Users should be able to control runtime filters via Spark configuration:
// Enable/disable runtime filters
spark.conf.set("spark.comet.runtimeFilter.enabled", true)
// Adjust thresholds
spark.conf.set("spark.comet.runtimeFilter.inFilterThreshold", 1000)
spark.conf.set("spark.comet.runtimeFilter.bloomFilterFpp", 0.01)
Describe the potential solution
No response
Additional context
No response
What is the problem the feature request solves?
Runtime Filters is a high-performance optimization that reduces I/O by 90% and improves query performance for join-heavy workloads. The feature filters data at scan time using lightweight data structures constructed during hash join build phases.
Runtime filters are lightweight data structures (IN sets, Min/Max bounds, Bloom filters) built during hash join build phases and pushed down to scan operators to filter data before reading from storage.
Example:
customerstable during join buildordersscan automaticallyFilter Types
IN Filter (Small cardinality <1000)
Min/Max Filter (Numeric/date types)
Bloom Filter (Large cardinality, future)
The system should automatically selects the optimal filter type:
Users should be able to control runtime filters via Spark configuration:
Describe the potential solution
No response
Additional context
No response