Skip to content

[Proposal] Runtime Filters for DataFusion Comet #3053

@Shekharrajak

Description

@Shekharrajak

What is the problem the feature request solves?

Runtime Filters is a high-performance optimization that reduces I/O by 90% and improves query performance for join-heavy workloads. The feature filters data at scan time using lightweight data structures constructed during hash join build phases.

Runtime filters are lightweight data structures (IN sets, Min/Max bounds, Bloom filters) built during hash join build phases and pushed down to scan operators to filter data before reading from storage.

Example:

-- User writes standard SQL
SELECT o.*, c.name 
FROM orders o 
JOIN customers c ON o.customer_id = c.id 
WHERE c.country = 'USA';
  1. Comet will detect join opportunity
  2. Builds filter from customers table during join build
  3. Applies filter to orders scan automatically
  4. Query runs faster with less I/O

Filter Types

  1. IN Filter (Small cardinality <1000)

  2. Min/Max Filter (Numeric/date types)

    • Min and max bounds
  3. Bloom Filter (Large cardinality, future)

    • Probabilistic data structure

The system should automatically selects the optimal filter type:

  • Numeric/date → Min/Max Filter (most efficient)
  • Small cardinality → IN Filter
  • Large cardinality → Bloom Filter

Users should be able to control runtime filters via Spark configuration:

// Enable/disable runtime filters
spark.conf.set("spark.comet.runtimeFilter.enabled", true)

// Adjust thresholds
spark.conf.set("spark.comet.runtimeFilter.inFilterThreshold", 1000)
spark.conf.set("spark.comet.runtimeFilter.bloomFilterFpp", 0.01)

Describe the potential solution

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions