Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050
Add SIMD-accelerated bulk range evaluation for dense numeric doc values#16050sgup432 wants to merge 3 commits into
Conversation
|
@romseygeek Do you mind taking a look at this? |
romseygeek
left a comment
There was a problem hiding this comment.
Thanks @sgup432, this looks great! I think we need some more comprehensive testing, and I left some notes on the API itself. I think I'd like @benwtrent or @uschindler's opinions on the vectorization code as that's not something I'm very familiar with.
| int toDoc, | ||
| long minValue, | ||
| long maxValue, | ||
| org.apache.lucene.util.FixedBitSet bitSet, |
There was a problem hiding this comment.
Does this need to be explicitly a FixedBitSet or can we use BitSet in the signature instead?
There was a problem hiding this comment.
I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)
There was a problem hiding this comment.
Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)
There was a problem hiding this comment.
Yeah using FixedBitSet was intentional. As @benwtrent mentioned, FixedBitSet is the type used by all existing intoBitSet methods in Lucene which is why I kept it like that.
| if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) { | ||
| return doc = blockDoc; | ||
| } | ||
|
|
There was a problem hiding this comment.
Let's assert that we're not in a YES_IF_PRESENT block
| return doc = NO_MORE_DOCS; | ||
| } | ||
| docToCheck = blockDoc; | ||
| if (blockIterator.getMatch() == SkipBlockRangeIterator.Match.YES) { |
| bitSet.set(blockStart - offset, blockEnd - offset); | ||
| break; | ||
|
|
||
| case YES_IF_PRESENT: |
There was a problem hiding this comment.
We're not expecting YES_IF_PRESENT here are we? If so then we need to take it into account in advance(), if not we should just assert false here.
| IndexWriterConfig iwc = new IndexWriterConfig(); | ||
| iwc.setCodec(new Lucene104Codec()); | ||
| IndexWriter w = new IndexWriter(dir, iwc); | ||
| Random r = new Random(42); |
There was a problem hiding this comment.
This should use random() to get the test seed
| public void testSingleFieldRangeCorrectness() throws Exception { | ||
| Query q = SortedNumericDocValuesField.newSlowRangeQuery("age", 20, 40); | ||
| int count = searcher.count(q); | ||
| assertTrue("Should find some docs in range [20,40]", count > 0); |
There was a problem hiding this comment.
I don't think we can assert this with the randomly generated values? We could conceivably get all docs with value 1 on some (admittedly unlikely) seed.
| */ | ||
| public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase { | ||
|
|
||
| private static final int DOC_COUNT = 50_000; |
There was a problem hiding this comment.
This seems like a lot of docs?
| * <p>Key behavioral notes: | ||
| * | ||
| * <ul> | ||
| * <li>Single-field range with a second clause (e.g., MatchAllDocsQuery): goes through {@code |
There was a problem hiding this comment.
I don't think we're testing for this case? In addition, it needs to be a restrictive filter of some kind, as MatchAllDocsQuery will get rewritten away by BQ.
| * rangeIntoBitSet()}. | ||
| * </ul> | ||
| */ | ||
| public class TestSkipBlockRangeIteratorIntoBitSet extends LuceneTestCase { |
There was a problem hiding this comment.
I think we should be doing some lower-level testing here, specifically of the intoBitSet call - you can look at TestSkipBlockRangeIterator to get an idea of what to check.
| // Only use SIMD if vector length >= 4 (AVX-256 or better). | ||
| // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit. | ||
| if (vectorLen < 4) { | ||
| // Scalar fallback: tight loop that JIT can auto-vectorize |
There was a problem hiding this comment.
Great optimization! Share my thoughts, the comments are not consistent across this PR, sometimes it claims JIT may auto-vectorize, sometimes JIT can ..
Curious can JIT really do this with virtual method call, control flow, and bitset operation? I guess it's hard, otherwise hand-written vectorization for range check wouldn't be necessary done by this PR.
|
|
||
| // Only use SIMD if vector length >= 4 (AVX-256 or better). | ||
| // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit. | ||
| if (vectorLen < 4) { |
There was a problem hiding this comment.
Curious has the 128-bit SIMD case been validated it's not worth vectorizing?
Also, with my limited knowledge, if small fromDoc-to-toDoc ranges were to work here, the scalar path might actually win, would it be better to fallback to scalar version base on doc range size rather than vector width?
| scratch[i] = values.get(d + i); | ||
| } | ||
| LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0); | ||
| VectorMask<Long> inRange = |
There was a problem hiding this comment.
Wondering if loop unrolling for SIMD can speed up further (sample)? I suspect if we were to profile this, the bottleneck might be serial values.get(d + i) gather from packed values, if we could read more compact values with fewer loop iterations, and parallelize the range check with more CPU level pipelines, that would be a win, but need to do performance test to vet.
| int base = d - offset; | ||
| while (maskBits != 0) { | ||
| int bit = Long.numberOfTrailingZeros(maskBits); | ||
| bitSet.set(base + bit); |
There was a problem hiding this comment.
The vectorized comparison is great, but here we do per-bit loop for the bitset update. Since docs are consecutive, maskBits already stores the exact bit we want, and its max value is 0xFF on AVX-512 (8 lanes). We could OR the mask directly into the bitset word(s) in constant time like O(2) + fewer branches, sample method in FixedBitSet would be
public void orMask(int startBit, long mask, int maskLen) {
int wordIndex = startBit >> 6;
int bitOffset = startBit & 63;
if (bitOffset + maskLen <= 64) {
bits[wordIndex] |= mask << bitOffset;
} else {
bits[wordIndex] |= mask << bitOffset;
bits[wordIndex + 1] |= mask >>> (64 - bitOffset);
}
}
benwtrent
left a comment
There was a problem hiding this comment.
I am surprised to see such good numbers with so much more perf opportunities still left to try!
Good idea :)
| .getDocValuesRangeSupport(); | ||
|
|
||
| // Static helper so anonymous inner classes can call DocValuesRangeSupport from the outer class | ||
| static void rangeIntoBitSetVectorized( |
There was a problem hiding this comment.
nit, the assumption is that it is vectorized, but it might be the "default" implementation. can we just name this rangeIntoBitSet? Or something other than vectorized.
| int toDoc, | ||
| long minValue, | ||
| long maxValue, | ||
| org.apache.lucene.util.FixedBitSet bitSet, |
There was a problem hiding this comment.
I wondered about this as well, but @romseygeek the DocIdSetIterator.intoBitSet is a FixedBitSet. I think if we are going to require a FixedBitSet, we need to adjust our logic to be way way faster and take advantage of the fact that we know its a FixedBitSet :)
| int offset) { | ||
| // Scalar tight loop — JIT may auto-vectorize this on modern JVMs. | ||
| for (int d = fromDoc; d < toDoc; d++) { | ||
| long v = values.get(d); |
There was a problem hiding this comment.
this tells me we eventually might actually want a int count = values.get(int[] docIds, long[] dest);
That is a larger change, but I suspect there is perf to be gained lower level just decoding the long values.
| int toDoc, | ||
| long minValue, | ||
| long maxValue, | ||
| org.apache.lucene.util.FixedBitSet bitSet, |
There was a problem hiding this comment.
Please can we actually import org.apache.lucene.util.FixedBitSet and just use FixedBitSet here :)
|
|
||
| // Only use SIMD if vector length >= 4 (AVX-256 or better). | ||
| // On 128-bit SIMD (2 longs), the scratch buffer overhead outweighs the benefit. | ||
| if (vectorLen < 4) { |
There was a problem hiding this comment.
Have you benchmarked this to indicate no improvement here?
| LongVector v = LongVector.fromArray(LONG_SPECIES, scratch, 0); | ||
| VectorMask<Long> inRange = | ||
| v.compare(VectorOperators.GE, minValue).and(v.compare(VectorOperators.LE, maxValue)); | ||
| long maskBits = inRange.toLong(); |
There was a problem hiding this comment.
Its a huge shame to throw away the maskBits which is already encoded as a long, especially when we know the bit set is a FixedBitSet and we have access to FixedBitSet.getBits ;)
| for (int d = loopBound; d < toDoc; d++) { | ||
| long v = values.get(d); | ||
| if (v >= minValue && v <= maxValue) { | ||
| bitSet.set(d - offset); | ||
| } | ||
| } |
There was a problem hiding this comment.
I wonder if we will hit windows of density, where v passes our predicate for multiple docs in a row. In that case, we could take advantage of FixedBitSet.set(int startIndex, int endIndex) which would provide a substantial speed up in those dense regions.
This same idea goes for the default, etc. versions.
| * | ||
| * @lucene.internal | ||
| */ | ||
| public interface DocValuesRangeSupport { |
There was a problem hiding this comment.
I think this support path, etc. all matches our existing patterns. Seems OK to me.
Description
Numeric range queries on dense fields use DocValuesRangeIterator, which is a TwoPhaseIterator that uses SkipBlockRangeIterator as an approximation. This works well, but for MAYBE blocks (where values partially overlap the query range), it still falls back to per-doc evaluation: each doc is checked individually via values.advance(doc) + values.longValue() + range comparison.
Since DocValuesRangeIterator is a TwoPhaseIterator,
DenseConjunctionBulkScorerroutes it through the leap-frog path(see here) andintoBitSet()is never called. This means SIMD is never used for MAYBE block evaluation, even though the underlying storage for dense fields is a packed long[] that's ideal for vectorized comparison.PR changes
For dense singleton numeric fields with a skip index, replace DocValuesRangeIterator with a new
BatchDocValuesRangeIteratorwhich is a plain DocIdSetIterator (not TwoPhaseIterator). This was added so that we force DenseConjunctionBulkScorer to call intoBitSet() on it directly, enabling the bitset intersection path. I am open to suggestion if this is a right approachThis PR also adds support to do SIMD-accelerated bulk range evaluation for MAYBE (partial overlap) blocks, which seem to be the most expensive case when running range queries through doc values.
For this we added below changes:
Add
NumericDocValues.rangeIntoBitSet(fromDoc, toDoc, minValue, maxValue, bitSet, offset): a new bulk API with a per-doc fallback default. Lucene90DocValuesProducer overrides this for dense fields to dispatch to the vectorization layer.Add a DocValuesRangeSupport interface with two implementations:
VectorizationProvider.getDocValuesRangeSupport()returns the appropriate implementation at startup.Benchmarks
MultiFieldDocValuesRangeBenchmark (c5.2xlarge, AVX-512)
The numbers look great across the board!