BooleanQuery: narrow bulk scoring when FILTER matches primary index sort by iprithv · Pull Request #15991 · apache/lucene

iprithv · 2026-04-28T18:13:41Z

Description

Implements #15139 : When a BooleanQuery has a FILTER clause that constrains the primary index sort field, rewrite to an internal query that can restrict the main clause's BulkScorer to a contiguous doc ID range derived from the filter, when it is safe to do so.

The optimization currently applies to:

sorted numeric range filters on the primary index sort field
exact term filters on a primary SortedSetSortField when doc values and postings agree on a dense range

If the range cannot be proven safe for a leaf, execution falls back to the original boolean query's bulk scorer for that leaf.

Benchmarking

Focused JMH benchmark: PrimaryIndexSortFilterBenchmark

Benchmark	Baseline (`origin/main`)	Candidate	Delta
`benchmarkSortedTopScores`	`1.315 ± 0.013 ops/ms`	`2.286 ± 0.032 ops/ms`	`~+74%`
`benchmarkSortedComplete`	`1.491 ± 0.020 ops/ms`	`2.145 ± 0.137 ops/ms`	`~+44%`
`benchmarkSortedCompleteNoScores`	`3.249 ± 0.027 ops/ms`	`4.774 ± 0.035 ops/ms`	`~+47%`
`benchmarkUnsortedTopScores`	`1.306 ± 0.014 ops/ms`	`1.334 ± 0.009 ops/ms`	`~+2%`

These numbers compare origin/main with the same benchmark class copied in against this branch, using the benchmark's full JMH annotations.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

iverase · 2026-05-07T13:42:08Z

I found the current implementation a bit confusing with all those type checks, and it doesn't seem fully implemented for all query types.

I wonder if we should implement this in a more java idiomatic way, for example by introducing a new base class for queries that can be used when filtering on a primary sort, something like:

public abstract class PrimarySortAlignedQuery extends Query {

  /** Field constrained by this query. */
  public abstract String getField();

  /** Whether a dense doc id interval can be derived on this leaf from primary sort layout. */
  public abstract boolean supportsDenseLeafRange(LeafReaderContext context) throws IOException;

  /** Matching docs as {@code [minDoc, maxDoc)} on this leaf, or {@code null} if unknown / not dense. */
  public abstract DocIdRange denseDocIdRangeOrNull(LeafReaderContext context) throws IOException;
}

iprithv · 2026-05-09T08:42:42Z

@iverase Agree the old shape might be hard to follow. I did a small refactor, I moved the logic that decides when a FILTER matches the primary sort field and we can restrict bulk scoring to a doc id range behind a tiny PrimarySortAlignable hook plus a helper for TermQuery, so the behaviour stays the same but the type checks aren’t scattered through FilteredOnPrimaryIndexSortFieldQuery. Thanks!

iverase · 2026-05-09T11:14:41Z

I think we want to implement this for other type of queries, let say SortedNumericDocValuesRangeQuery and SortedSetDocValuesRangeQuery when they have skippers or PointRangeQuery.

I don't think it is a good idea to have a helper ,why are you not implement the interfaces directly on the queries? it makes extending the behaviour for other queries easier IMO.

iprithv · 2026-05-09T17:08:11Z

@iverase Sure, all query types now implement PrimarySortAlignable directly. Thanks!

iverase · 2026-05-10T12:36:57Z

Tthanks. Looking at the test, it looks quite complex and a bit unexpected to have just one test when we are adding support for several queries. Would it be possible to rework the test so we can have a base class (a test case) that can be implemented by all the queries supporting this optimization?

iprithv · 2026-05-10T15:00:37Z

@iverase Sure, done. Added base class and per query subclasses so each optimiser supported query lives in its own test class. Thanks!

iverase · 2026-05-11T10:20:48Z

Thanks, this makes any query that can support this feature implement it just by implementing the interface and adding the test case.

One thing I am unhappy with the test is that the exercise very little randomization, which is one of the strengths of lucene testing framework. Looking at the name of the tests, I presume those have been generated by an agent, and one thing I notice is that agent are very good to hardcode a test to check a specific layout but they fail on building randomized tests. Do you think we can improve that?

iverase · 2026-05-11T10:35:52Z

+  String getField();
+
+  /** Whether this filter may participate in the optimization on the given index. */
+  boolean canOptimize(IndexSearcher searcher) throws IOException;


I am confused about what this method should do because every implementation you provided does a different thing. All of them checks that the field is part of the primary sort but then:

TermQuery: checks that the sort field is of a specific type and it is a singleton.

PointRangeQuery: That the field has actually points with dimension 1.

IndexSortSortedNumericDocValuesRangeQuery: no further checks.

I have the feeling what we might want to check always is that the index is dense (all documents have values) and all documents are single value. This means if the segment has deletions, we might not be able to use this optimization, or can we?

@iverase The density and single value checks are actually already there, but they live in denseDocIdRangeOrNull rather than canOptimize. The reason is that canOptimize is called once at the index level (across all segments) while density is a per leaf property, a segment that went through a merge without forceMerge might be dense, another might not. So canOptimize is intentionally a query eligibility check, and denseDocIdRangeOrNull is where each leaf either confirms a tight dense range or returns null and falls back.

Also, TermQuery.denseDocIdRangeOrNull guards with docFreq == range.maxDoc() - range.minDoc(), and IndexSortSortedNumericDocValuesRangeQuery checks pointValues.getDocCount() == reader.maxDoc() before returning a dense range (otherwise it falls back to sparse). So the safety net is there, it's just not visible from canOptimize.

Also on deletions, IndexSortSortedNumericDocValuesRangeQuery.getDocIdSetIteratorOrNull already bails early with hasDeletions() == false, so deleted segments don't participate in the optimization.

It's not obvious which checks belong in canOptimize vs denseDocIdRangeOrNull. I'll add a note to the interface javadoc if you prefer, to make that boundary explicit. Thanks!

IMHO it is not a good design to have checks scattered in different methods of the API.

I wonder if we should get rid of this method all together and always rewrite a query if a segments contains a primary sort on a field that is PrimarySortAlignable and it has no deletions?

makes sense, sure. Done. Thanks!

iverase

I think this is very close, I added a comment about code structure.

I noticed in some cases the tests are not testing the changes in this PR. Can we make sure we check the query gets rewritten as expected? In addition, could we add some test for sparse index too?

Thanks!

iverase · 2026-05-12T09:40:40Z

+  }
+
+  /** Filtered search returns exactly the expected number of hits. */
+  public void testFilteredHitCount() throws IOException {


I was playing with. the tests a bit an notice this is not testing this PR as the boolean query doesn't get rewritten into a FilteredOnPrimaryIndexSortFieldQuery because the MatchAllDocsQuery gets removed.

Can we make sure we are actually rewriting the query into a FilteredOnPrimaryIndexSortFieldQuery in all the tests that we are expecting that?

sure, done. Thanks!

iverase · 2026-05-12T09:45:01Z

+   * @return the dense doc-id range, or {@code null} if a safe range cannot be determined
+   * @lucene.internal
+   */
+  public static DocIdRange denseDocIdRangeOrNullForSortedNumericBounds(


This method feels not to belong here, maybe move it to PrimarySortAlignables like you have done for terms?

sure, moved it. thanks!

romseygeek · 2026-05-13T09:39:03Z

Contiguous term filters should have docIdRunEnd() implementations that effectively do this with a DenseConjunctionBulkScorer, I think? This is a lot of complex code and I'm not convinced it's going to improve performance that much, to be honest - I think we should concentrate on better cost() and docIdRunEnd() implementations.

iverase · 2026-05-13T12:18:22Z

I agree with @romseygeek that this is adding a lot of complexity and maybe we can get the same optimization by implementing properly cost() and docIdRunEnd().

For example I noticed in SortedSetDocValuesRangeQuery that the cost() was always the numver of documents in the doc values. I open #16061 to compute a better cost for dense primary sort fields.

iprithv · 2026-05-13T12:42:13Z

@romseygeek docIDRunEnd() is consumed today by DenseConjunctionBulkScorer, ReqExclBulkScorer, and SkipBlockRangeIterator. MaxScoreBulkScorer, ConjunctionBulkScorer, BlockMaxConjunctionBulkScorer and DefaultBulkScorer all use leap-frog advance() and ignore run-end. (grep docIDRunEnd under lucene/core/src/java/org/apache/lucene/search/ to verify)

DenseConjunctionBulkScorer is selected from two places, ConstantScoreScorerSupplier.bulkScorer() and BooleanScorerSupplier.requiredBulkScorer(). And the boolean path is gated on requiredScoring.isEmpty(). The shape this PR targets (scoring MUST + primary-sort-aligned FILTER, e.g. BM25 text query with a date/category filter) puts the MUST scorer in requiredScoring, so that branch is bypassed. We land in ConjunctionBulkScorer (for COMPLETE / COMPLETE_NO_SCORES) or MaxScoreBulkScorer (for TOP_SCORES), neither of which currently reads docIDRunEnd(). BooleanQuery.rewriteNoScoring() would flip MUST→FILTER, but it's only invoked from ConstantScoreQuery#rewrite, not from searcher.search(..., TopFieldCollectorManager(...)), so even the no scores case still keeps the MUST in requiredScoring.

For the term-filter case you mentioned specifically: PostingsEnum inherits the default docIDRunEnd() = docID()+1, so even if all docs in a primary-sort segment match the term contiguously, today's postings iterator doesn't expose that run to DenseConjunctionBulkScorer.

Covering this shape in the approach you mentioned would need these:

Postings (or a wrapper) exposing real docIDRunEnd() on primary-sort-aligned terms,
MaxScoreBulkScorer and ConjunctionBulkScorer learning to consume docIDRunEnd() for range-skipping
For COMPLETE_NO_SCORES, a path that converts MUST→FILTER outside of ConstantScoreQuery#rewrite.

Even after these it still leaves TOP_SCORES uncovered, which is where this PR shows the biggest delta (+74% in benchmarkSortedTopScores; the rest are +44% to +47% across the COMPLETE / COMPLETE_NO_SCORES modes).

I agree that making cost() and docIDRunEnd() the durable primitives and letting bulk scorers act on them is the right long term architecture, a more comprehensive docIDRunEnd() investment would let this PR's wrapper shrink or disappear over time. This PR is meant to be complementary rather than a replacement for that work.

Regarding complexity, the scope grew from review feedback to implement the interface directly on PointRangeQuery, SortedNumericDocValuesRangeQuery, and SortedSetDocValuesRangeQuery in addition to the original numeric range path. Thanks!

iprithv · 2026-05-13T12:49:13Z

@iverase @romseygeek Thanks both. I'll close this for now. Thanks for the thorough review!

romseygeek · 2026-05-14T09:18:33Z

Our latest PostingsEnum implementations do actually implement docIdRunEnd() where they can (eg see Lucene104PostingsReader). MaxScoreBulkScorer currently has a TODO about using bitsets for filters that I think would get us a lot of benefits here, so maybe that's something to concentrate on as an immediate next step?

restrict bulk scoring for primary index sort FILTER

958c4f0

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

github-actions Bot added the module:core/search label Apr 28, 2026

Updated CHANGES

0d19e0d

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

github-actions Bot added this to the 11.0.0 milestone Apr 28, 2026

Updated CHANGES

9df46b8

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

iprithv force-pushed the opt-primary-sort-filter branch from fe1fe81 to 9df46b8 Compare April 28, 2026 18:21

iprithv added 5 commits April 29, 2026 01:10

Fix primary sort term filter fallback

0239e64

Updated tests

3994d5b

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

Fix google-java-format violations

daa750b

Merge branch 'main' into opt-primary-sort-filter

3c38b8c

Merge branch 'main' into opt-primary-sort-filter

5b7d97a

iprithv added 2 commits May 9, 2026 10:20

review changes

bf14e29

Fix Google Java Format

6d10eb9

Merge branch 'main' into opt-primary-sort-filter

ada757a

iprithv added 3 commits May 9, 2026 22:17

Implement PrimarySortAlignable directly on query classes per review

764f84c

Merge branch 'main' into opt-primary-sort-filter

22f4a62

check

c9ab62b

iprithv added 2 commits May 9, 2026 23:51

tweaks

e0cf2cb

Fix flaky test

e659b50

iprithv added 2 commits May 10, 2026 19:42

review changes

423d693

Lint issue fix

d3ad3b2

iprithv force-pushed the opt-primary-sort-filter branch from 44ed317 to d3ad3b2 Compare May 10, 2026 14:44

iverase reviewed May 11, 2026

View reviewed changes

iprithv added 2 commits May 11, 2026 23:50

review changes

178f45c

lint,google format check

482c9a4

iprithv requested a review from iverase May 11, 2026 19:05

iverase requested changes May 12, 2026

View reviewed changes

review changes

63db125

iprithv requested a review from iverase May 12, 2026 13:22

iprithv closed this May 13, 2026

Conversation

iprithv commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmarking

Uh oh!

iverase commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iprithv commented May 9, 2026

Uh oh!

iverase commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iprithv commented May 9, 2026

Uh oh!

iverase commented May 10, 2026

Uh oh!

iprithv commented May 10, 2026

Uh oh!

iverase commented May 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romseygeek commented May 13, 2026

Uh oh!

iverase commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iprithv commented May 13, 2026

Uh oh!

iprithv commented May 13, 2026

Uh oh!

romseygeek commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iprithv commented Apr 28, 2026 •

edited

Loading

iverase commented May 7, 2026 •

edited

Loading

iverase commented May 9, 2026 •

edited

Loading

iverase commented May 13, 2026 •

edited

Loading