Skip to content

Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050

Open
sgup432 wants to merge 3 commits into
apache:mainfrom
sgup432:simd_doc_values_range
Open

Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050
sgup432 wants to merge 3 commits into
apache:mainfrom
sgup432:simd_doc_values_range

Conversation

@sgup432
Copy link
Copy Markdown
Contributor

@sgup432 sgup432 commented May 12, 2026

Description

Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.

Since DocValuesRangeIterator is a TwoPhaseIterator, DenseConjunctionBulkScorer routes it through the leap-frog path(see here) and intoBitSet() is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.

PR changes

For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new BatchDocValuesRangeIterator which is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approach

This PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.

For this we added below changes:

  • Add NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.

  • Add a DocValuesRangeSupport interface with two implementations:

    • PanamaDocValuesRangeSupport — SIMD implementation using the Panama Vector API (LongVector.SPECIES_PREFERRED). Evaluates multiple values per CPU instruction using vectorized range comparisons.
    • DefaultDocValuesRangeSupport — scalar tight loop fallback.
  • VectorizationProvider.getDocValuesRangeSupport() returns the appropriate implementation at startup.

Benchmarks

MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)

Mode: Throughput (ops/s, higher is better)
JVM args: --add-modules=jdk.incubator.vector
Warmup: 3 x 3s, Measurement: 5 x 5s, Fork: 1
Data Pattern docCount Fields Baseline (ops/s) Optimized (ops/s) Change
random 1M 1 59.99 208.27 +247%
random 1M 3 34.83 69.30 +99%
random 1M 5 29.40 65.10 +121%
random 10M 1 6.12 25.16 +311%
random 10M 3 3.41 8.38 +146%
random 10M 5 2.82 7.45 +164%
clustered 1M 1 6231.86 8584.63 +38%
clustered 1M 3 9142.82 35488.66 +288%
clustered 1M 5 7072.30 32583.89 +361%
clustered 10M 1 685.27 1253.04 +83%
clustered 10M 3 8314.53 23913.65 +188%
clustered 10M 5 8855.14 12703.13 +43%

The numbers look great across the board!

@github-actions github-actions Bot added this to the 10.5.0 milestone May 12, 2026
@sgup432
Copy link
Copy Markdown
Contributor Author

sgup432 commented May 12, 2026

@romseygeek Do you mind taking a look at this?

Copy link
Copy Markdown
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.

int toDoc,
long minValue,
long maxValue,
org.apache.lucene.util.FixedBitSet bitSet,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be explicitly a FixedBitSet or can we use BitSet in the signature instead?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah using FixedBitSet was intentional. As @benwtrent mentioned, FixedBitSet is the type used by all existing intoBitSet methods in Lucene which is why I kept it like that.

if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) {
return doc = blockDoc;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assert that we're not in a YES_IF_PRESENT block

return doc = NO_MORE_DOCS;
}
docToCheck = blockDoc;
if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also assert here

bitSet.set(blockStart - offset, blockEnd - offset);
break;

case YES_IF_PRESENT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not expecting YES_IF_PRESENT here are we? If so then we need to take it into account in advance(), if not we should just assert false here.

IndexWriterConfig iwc = new IndexWriterConfig();
iwc.setCodec(new Lucene104Codec());
IndexWriter w = new IndexWriter(dir, iwc);
Random r = new Random(42);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use random() to get the test seed

public void testSingleFieldRangeCorrectness() throws Exception {
Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40);
int count = searcher.count(q);
assertTrue("Should find some docs in range [20,40]", count > 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.

*/
public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {

private static final int DOC_COUNT = 50_000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of docs?

* <p>Key behavioral notes:
*
* <ul>
* <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.

* rangeIntoBitSet()}.
* </ul>
*/
public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.

// Only use SIMD if vector length >= 4 (AVX-256 or better).
// On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
if (vectorLen < 4) {
// Scalar fallback: tight loop that JIT can auto-vectorize
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great optimization! Share my thoughts, the comments are not consistent across this PR, sometimes it claims JIT may auto-vectorize, sometimes JIT can ..

Curious can JIT really do this with virtual method call, control flow, and bitset operation? I guess it's hard, otherwise hand-written vectorization for range check wouldn't be necessary done by this PR.


// Only use SIMD if vector length >= 4 (AVX-256 or better).
// On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
if (vectorLen < 4) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious has the 128-bit SIMD case been validated it's not worth vectorizing?

Also, with my limited knowledge, if small fromDoc-to-toDoc ranges were to work here, the scalar path might actually win, would it be better to fallback to scalar version base on doc range size rather than vector width?

scratch[i] = values.get(d + i);
}
LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
VectorMask<Long> inRange =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.

int base = d - offset;
while (maskBits != 0) {
int bit = Long.numberOfTrailingZeros(maskBits);
bitSet.set(base + bit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be

public void orMask(int startBit, long mask, int maskLen) {
    int wordIndex = startBit >> 6;
    int bitOffset = startBit & 63;
    if (bitOffset + maskLen <= 64) {
        bits[wordIndex] |= mask << bitOffset;
    } else {
        bits[wordIndex]     |= mask << bitOffset;
        bits[wordIndex + 1] |= mask >>> (64 - bitOffset);
    }
}

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised to see such good numbers with so much more perf opportunities still left to try!

Good idea :)

.getDocValuesRangeSupport();

// Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class
static void rangeIntoBitSetVectorized(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.

int toDoc,
long minValue,
long maxValue,
org.apache.lucene.util.FixedBitSet bitSet,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)

int offset) {
// Scalar tight loop — JIT may auto-vectorize this on modern JVMs.
for (int d = fromDoc; d < toDoc; d++) {
long v = values.get(d);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);

That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.

int toDoc,
long minValue,
long maxValue,
org.apache.lucene.util.FixedBitSet bitSet,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)


// Only use SIMD if vector length >= 4 (AVX-256 or better).
// On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
if (vectorLen < 4) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you benchmarked this to indicate no improvement here?

LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
VectorMask<Long> inRange =
v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue));
long maskBits = inRange.toLong();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)

Comment on lines +83 to +88
for (int d = loopBound; d < toDoc; d++) {
long v = values.get(d);
if (v >= minValue && v <= maxValue) {
bitSet.set(d - offset);
}
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.

This same idea goes for the default, etc. versions.

*
* @lucene.internal
*/
public interface DocValuesRangeSupport {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this support path, etc. all matches our existing patterns. Seems OK to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants