Skip to content

BooleanQuery: narrow bulk scoring when FILTER matches primary index sort#15991

Closed
iprithv wants to merge 21 commits into
apache:mainfrom
iprithv:opt-primary-sort-filter
Closed

BooleanQuery: narrow bulk scoring when FILTER matches primary index sort#15991
iprithv wants to merge 21 commits into
apache:mainfrom
iprithv:opt-primary-sort-filter

Conversation

@iprithv
Copy link
Copy Markdown
Contributor

@iprithv iprithv commented Apr 28, 2026

Description

Implements #15139 : When a BooleanQuery has a FILTER clause that constrains the primary index sort field, rewrite to an internal query that can restrict the main clause's BulkScorer to a contiguous doc ID range derived from the filter, when it is safe to do so.

The optimization currently applies to:

  • sorted numeric range filters on the primary index sort field
  • exact term filters on a primary SortedSetSortField when doc values and postings agree on a dense range

If the range cannot be proven safe for a leaf, execution falls back to the original boolean query's bulk scorer for that leaf.

Benchmarking

Focused JMH benchmark: PrimaryIndexSortFilterBenchmark

Benchmark Baseline (origin/main) Candidate Delta
benchmarkSortedTopScores 1.315 ± 0.013 ops/ms 2.286 ± 0.032 ops/ms ~+74%
benchmarkSortedComplete 1.491 ± 0.020 ops/ms 2.145 ± 0.137 ops/ms ~+44%
benchmarkSortedCompleteNoScores 3.249 ± 0.027 ops/ms 4.774 ± 0.035 ops/ms ~+47%
benchmarkUnsortedTopScores 1.306 ± 0.014 ops/ms 1.334 ± 0.009 ops/ms ~+2%

These numbers compare origin/main with the same benchmark class copied in against this branch, using the benchmark's full JMH annotations.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
@github-actions github-actions Bot added this to the 11.0.0 milestone Apr 28, 2026
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
@iprithv iprithv force-pushed the opt-primary-sort-filter branch from fe1fe81 to 9df46b8 Compare April 28, 2026 18:21
@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 7, 2026

I found the current implementation a bit confusing with all those type checks, and it doesn't seem fully implemented for all query types.

I wonder if we should implement this in a more java idiomatic way, for example by introducing a new base class for queries that can be used when filtering on a primary sort, something like:

public abstract class PrimarySortAlignedQuery extends Query {

  /** Field constrained by this query. */
  public abstract String getField();

  /** Whether a dense doc id interval can be derived on this leaf from primary sort layout. */
  public abstract boolean supportsDenseLeafRange(LeafReaderContext context) throws IOException;

  /** Matching docs as {@code [minDoc, maxDoc)} on this leaf, or {@code null} if unknown / not dense. */
  public abstract DocIdRange denseDocIdRangeOrNull(LeafReaderContext context) throws IOException;
}

@iprithv
Copy link
Copy Markdown
Contributor Author

iprithv commented May 9, 2026

@iverase Agree the old shape might be hard to follow. I did a small refactor, I moved the logic that decides when a FILTER matches the primary sort field and we can restrict bulk scoring to a doc id range behind a tiny PrimarySortAlignable hook plus a helper for TermQuery, so the behaviour stays the same but the type checks aren’t scattered through FilteredOnPrimaryIndexSortFieldQuery. Thanks!

@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 9, 2026

I think we want to implement this for other type of queries, let say SortedNumericDocValuesRangeQuery and SortedSetDocValuesRangeQuery when they have skippers or PointRangeQuery.

I don't think it is a good idea to have a helper ,why are you not implement the interfaces directly on the queries? it makes extending the behaviour for other queries easier IMO.

@iprithv
Copy link
Copy Markdown
Contributor Author

iprithv commented May 9, 2026

@iverase Sure, all query types now implement PrimarySortAlignable directly. Thanks!

@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 10, 2026

Tthanks. Looking at the test, it looks quite complex and a bit unexpected to have just one test when we are adding support for several queries. Would it be possible to rework the test so we can have a base class (a test case) that can be implemented by all the queries supporting this optimization?

@iprithv iprithv force-pushed the opt-primary-sort-filter branch from 44ed317 to d3ad3b2 Compare May 10, 2026 14:44
@iprithv
Copy link
Copy Markdown
Contributor Author

iprithv commented May 10, 2026

@iverase Sure, done. Added base class and per query subclasses so each optimiser supported query lives in its own test class. Thanks!

@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 11, 2026

Thanks, this makes any query that can support this feature implement it just by implementing the interface and adding the test case.

One thing I am unhappy with the test is that the exercise very little randomization, which is one of the strengths of lucene testing framework. Looking at the name of the tests, I presume those have been generated by an agent, and one thing I notice is that agent are very good to hardcode a test to check a specific layout but they fail on building randomized tests. Do you think we can improve that?

String getField();

/** Whether this filter may participate in the optimization on the given index. */
boolean canOptimize(IndexSearcher searcher) throws IOException;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused about what this method should do because every implementation you provided does a different thing. All of them checks that the field is part of the primary sort but then:

  • TermQuery: checks that the sort field is of a specific type and it is a singleton.
  • PointRangeQuery: That the field has actually points with dimension 1.
  • IndexSortSortedNumericDocValuesRangeQuery: no further checks.

I have the feeling what we might want to check always is that the index is dense (all documents have values) and all documents are single value. This means if the segment has deletions, we might not be able to use this optimization, or can we?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iverase The density and single value checks are actually already there, but they live in denseDocIdRangeOrNull rather than canOptimize. The reason is that canOptimize is called once at the index level (across all segments) while density is a per leaf property, a segment that went through a merge without forceMerge might be dense, another might not. So canOptimize is intentionally a query eligibility check, and denseDocIdRangeOrNull is where each leaf either confirms a tight dense range or returns null and falls back.

Also, TermQuery.denseDocIdRangeOrNull guards with docFreq == range.maxDoc() - range.minDoc(), and IndexSortSortedNumericDocValuesRangeQuery checks pointValues.getDocCount() == reader.maxDoc() before returning a dense range (otherwise it falls back to sparse). So the safety net is there, it's just not visible from canOptimize.

Also on deletions, IndexSortSortedNumericDocValuesRangeQuery.getDocIdSetIteratorOrNull already bails early with hasDeletions() == false, so deleted segments don't participate in the optimization.

It's not obvious which checks belong in canOptimize vs denseDocIdRangeOrNull. I'll add a note to the interface javadoc if you prefer, to make that boundary explicit. Thanks!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it is not a good design to have checks scattered in different methods of the API.

I wonder if we should get rid of this method all together and always rewrite a query if a segments contains a primary sort on a field that is PrimarySortAlignable and it has no deletions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, sure. Done. Thanks!

@iprithv iprithv requested a review from iverase May 11, 2026 19:05
Copy link
Copy Markdown
Contributor

@iverase iverase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very close, I added a comment about code structure.

I noticed in some cases the tests are not testing the changes in this PR. Can we make sure we check the query gets rewritten as expected? In addition, could we add some test for sparse index too?

Thanks!

}

/** Filtered search returns exactly the expected number of hits. */
public void testFilteredHitCount() throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing with. the tests a bit an notice this is not testing this PR as the boolean query doesn't get rewritten into a FilteredOnPrimaryIndexSortFieldQuery because the MatchAllDocsQuery gets removed.

Can we make sure we are actually rewriting the query into a FilteredOnPrimaryIndexSortFieldQuery in all the tests that we are expecting that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done. Thanks!

* @return the dense doc-id range, or {@code null} if a safe range cannot be determined
* @lucene.internal
*/
public static DocIdRange denseDocIdRangeOrNullForSortedNumericBounds(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method feels not to belong here, maybe move it to PrimarySortAlignables like you have done for terms?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, moved it. thanks!

@iprithv iprithv requested a review from iverase May 12, 2026 13:22
@romseygeek
Copy link
Copy Markdown
Contributor

Contiguous term filters should have docIdRunEnd() implementations that effectively do this with a DenseConjunctionBulkScorer, I think? This is a lot of complex code and I'm not convinced it's going to improve performance that much, to be honest - I think we should concentrate on better cost() and docIdRunEnd() implementations.

@iverase
Copy link
Copy Markdown
Contributor

iverase commented May 13, 2026

I agree with @romseygeek that this is adding a lot of complexity and maybe we can get the same optimization by implementing properly cost() and docIdRunEnd().

For example I noticed in SortedSetDocValuesRangeQuery that the cost() was always the numver of documents in the doc values. I open #16061 to compute a better cost for dense primary sort fields.

@iprithv
Copy link
Copy Markdown
Contributor Author

iprithv commented May 13, 2026

@romseygeek docIDRunEnd() is consumed today by DenseConjunctionBulkScorer, ReqExclBulkScorer, and SkipBlockRangeIterator. MaxScoreBulkScorer, ConjunctionBulkScorer, BlockMaxConjunctionBulkScorer and DefaultBulkScorer all use leap-frog advance() and ignore run-end. (grep docIDRunEnd under lucene/core/src/java/org/apache/lucene/search/ to verify)

DenseConjunctionBulkScorer is selected from two places, ConstantScoreScorerSupplier.bulkScorer() and BooleanScorerSupplier.requiredBulkScorer(). And the boolean path is gated on requiredScoring.isEmpty(). The shape this PR targets (scoring MUST + primary-sort-aligned FILTER, e.g. BM25 text query with a date/category filter) puts the MUST scorer in requiredScoring, so that branch is bypassed. We land in ConjunctionBulkScorer (for COMPLETE / COMPLETE_NO_SCORES) or MaxScoreBulkScorer (for TOP_SCORES), neither of which currently reads docIDRunEnd(). BooleanQuery.rewriteNoScoring() would flip MUST→FILTER, but it's only invoked from ConstantScoreQuery#rewrite, not from searcher.search(..., TopFieldCollectorManager(...)), so even the no scores case still keeps the MUST in requiredScoring.

For the term-filter case you mentioned specifically: PostingsEnum inherits the default docIDRunEnd() = docID()+1, so even if all docs in a primary-sort segment match the term contiguously, today's postings iterator doesn't expose that run to DenseConjunctionBulkScorer.

Covering this shape in the approach you mentioned would need these:

  1. Postings (or a wrapper) exposing real docIDRunEnd() on primary-sort-aligned terms,
  2. MaxScoreBulkScorer and ConjunctionBulkScorer learning to consume docIDRunEnd() for range-skipping
  3. For COMPLETE_NO_SCORES, a path that converts MUST→FILTER outside of ConstantScoreQuery#rewrite.

Even after these it still leaves TOP_SCORES uncovered, which is where this PR shows the biggest delta (+74% in benchmarkSortedTopScores; the rest are +44% to +47% across the COMPLETE / COMPLETE_NO_SCORES modes).

I agree that making cost() and docIDRunEnd() the durable primitives and letting bulk scorers act on them is the right long term architecture, a more comprehensive docIDRunEnd() investment would let this PR's wrapper shrink or disappear over time. This PR is meant to be complementary rather than a replacement for that work.

Regarding complexity, the scope grew from review feedback to implement the interface directly on PointRangeQuery, SortedNumericDocValuesRangeQuery, and SortedSetDocValuesRangeQuery in addition to the original numeric range path. Thanks!

@iprithv
Copy link
Copy Markdown
Contributor Author

iprithv commented May 13, 2026

@iverase @romseygeek Thanks both. I'll close this for now. Thanks for the thorough review!

@iprithv iprithv closed this May 13, 2026
@romseygeek
Copy link
Copy Markdown
Contributor

Our latest PostingsEnum implementations do actually implement docIdRunEnd() where they can (eg see Lucene104PostingsReader). MaxScoreBulkScorer currently has a TODO about using bitsets for filters that I think would get us a lot of benefits here, so maybe that's something to concentrate on as an immediate next step?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants