Add SIMD-accelerated bulk range evaluation for dense numeric doc values by sgup432 · Pull Request #16050 · apache/lucene

sgup432 · 2026-05-12T06:29:57Z

Description

Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.

Since DocValuesRangeIterator is a TwoPhaseIterator, DenseConjunctionBulkScorer routes it through the leap-frog path(see here) and intoBitSet() is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.

PR changes

For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new BatchDocValuesRangeIterator which is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approach

This PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.

For this we added below changes:

Add NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.
Add a DocValuesRangeSupport interface with two implementations:
- PanamaDocValuesRangeSupport — SIMD implementation using the Panama Vector API (LongVector.SPECIES_PREFERRED). Evaluates multiple values per CPU instruction using vectorized range comparisons.
- DefaultDocValuesRangeSupport — scalar tight loop fallback.
VectorizationProvider.getDocValuesRangeSupport() returns the appropriate implementation at startup.

Benchmarks

MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)

Mode: Throughput (ops/s, higher is better)
JVM args: --add-modules=jdk.incubator.vector
Warmup: 3 x 3s, Measurement: 5 x 5s, Fork: 1

Data Pattern	docCount	Fields	Baseline (ops/s)	Optimized (ops/s)	Change
random	1M	1	59.99	208.27	+247%
random	1M	3	34.83	69.30	+99%
random	1M	5	29.40	65.10	+121%
random	10M	1	6.12	25.16	+311%
random	10M	3	3.41	8.38	+146%
random	10M	5	2.82	7.45	+164%
clustered	1M	1	6231.86	8584.63	+38%
clustered	1M	3	9142.82	35488.66	+288%
clustered	1M	5	7072.30	32583.89	+361%
clustered	10M	1	685.27	1253.04	+83%
clustered	10M	3	8314.53	23913.65	+188%
clustered	10M	5	8855.14	12703.13	+43%

The numbers look great across the board!

sgup432 · 2026-05-12T11:03:15Z

@romseygeek Do you mind taking a look at this?

romseygeek

Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.

romseygeek · 2026-05-13T08:26:30Z

+      int toDoc,
+      long minValue,
+      long maxValue,
+      org.apache.lucene.util.FixedBitSet bitSet,


Does this need to be explicitly a FixedBitSet or can we use BitSet in the signature instead?

I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)

Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)

Yeah using FixedBitSet was intentional. As @benwtrent mentioned, FixedBitSet is the type used by all existing intoBitSet methods in Lucene which is why I kept it like that.

romseygeek · 2026-05-13T08:43:40Z

+    if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) {
+      return doc = blockDoc;
+    }
+


Let's assert that we're not in a YES_IF_PRESENT block

romseygeek · 2026-05-13T08:45:03Z

+          return doc = NO_MORE_DOCS;
+        }
+        docToCheck = blockDoc;
+        if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) {


Also assert here

romseygeek · 2026-05-13T08:46:27Z

+          bitSet.set(blockStart - offset, blockEnd - offset);
+          break;
+
+        case YES_IF_PRESENT:


We're not expecting YES_IF_PRESENT here are we? If so then we need to take it into account in advance(), if not we should just assert false here.

romseygeek · 2026-05-13T08:49:29Z

+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setCodec(new Lucene104Codec());
+    IndexWriter w = new IndexWriter(dir, iwc);
+    Random r = new Random(42);


This should use random() to get the test seed

romseygeek · 2026-05-13T08:50:53Z

+  public void testSingleFieldRangeCorrectness() throws Exception {
+    Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40);
+    int count = searcher.count(q);
+    assertTrue("Should find some docs in range [20,40]", count > 0);


I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.

romseygeek · 2026-05-13T08:51:11Z

+ */
+public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {
+
+  private static final int DOC_COUNT = 50_000;


This seems like a lot of docs?

romseygeek · 2026-05-13T08:51:59Z

+ * <p>Key behavioral notes:
+ *
+ * <ul>
+ *   <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code


I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.

romseygeek · 2026-05-13T08:53:07Z

+ *       rangeIntoBitSet()}.
+ * </ul>
+ */
+public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase {


I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.

neoremind · 2026-05-13T10:05:20Z

+    // Only use SIMD if vector length >= 4 (AVX-256 or better).
+    // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
+    if (vectorLen < 4) {
+      // Scalar fallback: tight loop that JIT can auto-vectorize


Great optimization! Share my thoughts, the comments are not consistent across this PR, sometimes it claims JIT may auto-vectorize, sometimes JIT can ..

Curious can JIT really do this with virtual method call, control flow, and bitset operation? I guess it's hard, otherwise hand-written vectorization for range check wouldn't be necessary done by this PR.

neoremind · 2026-05-13T10:06:13Z

+
+    // Only use SIMD if vector length >= 4 (AVX-256 or better).
+    // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
+    if (vectorLen < 4) {


Curious has the 128-bit SIMD case been validated it's not worth vectorizing?

Also, with my limited knowledge, if small fromDoc-to-toDoc ranges were to work here, the scalar path might actually win, would it be better to fallback to scalar version base on doc range size rather than vector width?

neoremind · 2026-05-13T10:09:31Z

+        scratch[i] = values.get(d + i);
+      }
+      LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
+      VectorMask<Long> inRange =


Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.

neoremind · 2026-05-13T10:11:12Z

+        int base = d - offset;
+        while (maskBits != 0) {
+          int bit = Long.numberOfTrailingZeros(maskBits);
+          bitSet.set(base + bit);


The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be

public void orMask(int startBit, long mask, int maskLen) { int wordIndex = startBit >> 6; int bitOffset = startBit & 63; if (bitOffset + maskLen <= 64) { bits[wordIndex] |= mask << bitOffset; } else { bits[wordIndex] |= mask << bitOffset; bits[wordIndex + 1] |= mask >>> (64 - bitOffset); } }

benwtrent

I am surprised to see such good numbers with so much more perf opportunities still left to try!

Good idea :)

benwtrent · 2026-05-13T16:40:06Z

+              .getDocValuesRangeSupport();
+
+  // Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class
+  static void rangeIntoBitSetVectorized(


nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.

benwtrent · 2026-05-13T16:53:21Z

+      int toDoc,
+      long minValue,
+      long maxValue,
+      org.apache.lucene.util.FixedBitSet bitSet,


I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)

benwtrent · 2026-05-13T16:54:55Z

+      int offset) {
+    // Scalar tight loop — JIT may auto-vectorize this on modern JVMs.
+    for (int d = fromDoc; d < toDoc; d++) {
+      long v = values.get(d);


this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);

That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.

benwtrent · 2026-05-13T16:55:32Z

+      int toDoc,
+      long minValue,
+      long maxValue,
+      org.apache.lucene.util.FixedBitSet bitSet,


Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)

benwtrent · 2026-05-13T16:56:47Z

+
+    // Only use SIMD if vector length >= 4 (AVX-256 or better).
+    // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit.
+    if (vectorLen < 4) {


Have you benchmarked this to indicate no improvement here?

benwtrent · 2026-05-13T16:57:56Z

+      LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0);
+      VectorMask<Long> inRange =
+          v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue));
+      long maskBits = inRange.toLong();


Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)

benwtrent · 2026-05-13T17:00:25Z

+    for (int d = loopBound; d < toDoc; d++) {
+      long v = values.get(d);
+      if (v >= minValue && v <= maxValue) {
+        bitSet.set(d - offset);
+      }
+    }


I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.

This same idea goes for the default, etc. versions.

benwtrent · 2026-05-13T17:00:54Z

+ *
+ * @lucene.internal
+ */
+public interface DocValuesRangeSupport {


I think this support path, etc. all matches our existing patterns. Seems OK to me.

Add SIMD-accelerated bulk range evaluation for dense numeric doc values

77ec451

github-actions Bot added module:core/index module:core/search module:core/codecs labels May 12, 2026

Add changelog

5b2e742

github-actions Bot added this to the 10.5.0 milestone May 12, 2026

Minor refactor

ecaba41

romseygeek requested changes May 13, 2026

View reviewed changes

neoremind reviewed May 13, 2026

View reviewed changes

benwtrent reviewed May 13, 2026

View reviewed changes

Conversation

sgup432 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR changes

Benchmarks

Uh oh!

sgup432 commented May 12, 2026

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sgup432 commented May 12, 2026 •

edited

Loading