DiversifyingChildren speedup - siblings expansion#16034
Conversation
|
Hi @benwtrent, From your conversation:
Sorry for the long description of this draft PR! Let me know what you think about this :) |
| assert scores != null && scores.length >= eps.length; | ||
| scorer.bulkScore(eps, scores, eps.length); | ||
| results.incVisitedCount(eps.length); | ||
| float[] siblingScores = null; |
There was a problem hiding this comment.
So, let's not do it in the entry point exploration. I think just doing max there is the best way.
There was a problem hiding this comment.
Hi @benwtrent,
From what @alessandrobenedetti and I had seen, scoreEntryPoints() is done only once (multiple times only with a seeded query in knn), since we have:
org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search
search() calling findBestEntryPoint() + searchLevel()
org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel
and searchLevel() (which calls scoreEntryPoints()) is called only on level 0
It shouldn't add too much work here, right?
Moving this to the search-only part would break some assertions on visits that we would need to manage otherwise...
There was a problem hiding this comment.
scoreEntryPoints is about getting to the best approximate option in the bottom layer. I don't think it adds too much work, but I wonder it is helping at all, and I suspect it isn't, and thus shouldn't be done.
| // Fetch siblings BEFORE collect() so the parent is not yet in the heap | ||
| int[] siblings = null; |
There was a problem hiding this comment.
Let's reuse scratch space (it won't add much, but we definitely shouldn't be creating a new int[] on every sibling expansion.)
I would adjust siblings = expander.pendingSiblingOrdinals(node, visited, siblings); to allow reusing scratch space or expanding it and then reusing it later.
There was a problem hiding this comment.
Addressed, let me know if the new implementation is what you expected.
|
@aruggero thank you for the first pass here! Hmmm, it is frustrating that bulk scoring the children doesn't help much. I guess it makes sense as we are forcing more scoring per node. I do think we shouldn't do the expansion when gathering the entry point. I wonder if there is a "hybrid" scenario where we don't expand until the collector queue is full.... Interesting numbers for sure! |
…mntInspectChildrenSibiling
| // HNSW nodes are identified by their ordinal (the position in the flat vector store). So when the searcher returns | ||
| // ordinal k as a graph node, docToOrd[docId] = k being correct means docIdToOrdinal will find the right HNSW node | ||
| // for any sibling docId. | ||
| private int[] buildDocToOrd(LeafReaderContext context) throws IOException { |
There was a problem hiding this comment.
Sorry for the big comment above @benwtrent
Just to remember why some choices were made :)
Could you check if this method is correctly implemented?
We should be inside a single segment at this point, right?
And both ordinals and docids are segment-related (not globally unique), right?
|
Here are the new benchmark results, thanks to the scratch space reuse:
The main change worth calling out:
We still have a significant overhead in general. |
|
@aruggero wow, thank you a bunch for all the benchmarks and work! man, it is frustrating how this doesn't give us ANYTHING out of the box :( I suppose it makes sense. The benefit of HNSW is that its a very sparse graph with big jumps, so multiplying ops per connection are nasty. I was sort of holding out hope that we would get a cheeky perf bump :( Especially since bulk scoring is usually much faster (since vectors are also near each other in memory) and paragraphs/sub-vectors should be near one another in space already. While I am not sure this is something we should provide OOTB for all diverse children queries, it does seem neat to be able to provide "give me parent docs based on the FURTHEST child", this forces even the most irrelevant child to be considered. |
|
@benwtrent, even if this implementation is not worth contributing, @alessandrobenedetti and I were wondering if maybe the benchmark is? |
Do you mean like sorting the results by the "best-worst" child of a parent? |
Sibling Expansion for DiversifyingChildrenKnnQuery HNSW Search
Summary
This contribution introduces sibling expansion as an optimization for KNN vector search over parent-child document relationships (i.e.,
DiversifyingChildrenFloatKnnVectorQuery/DiversifyingChildrenByteKnnVectorQuery).When the HNSW graph searcher encounters a child node belonging to a newly discovered parent, all siblings of that parent (other children of the same parent not yet visited) are immediately scored and collected — without requiring further graph traversal to reach them. This improves recall for nested document use cases where multiple child vectors share a parent, as siblings that are close in the document structure may not be well-connected in the HNSW graph.
Changes
New interfaces (lucene/core)
ChildrenSiblingExpansion— implemented byKnnCollectorinstances that support ordinal-level sibling expansion during HNSW search. The searcher callspendingSiblingOrdinals()before collecting a node to get unvisited siblings to score immediately.DocSiblingExpansion— doc-ID-level companion used byOrdinalTranslatedKnnCollectorto bridge between HNSW ordinal space and collector doc-ID space.Core HNSW searcher (
AbstractHnswGraphSearcher)ChildrenSiblingExpansion, siblings are bulk-scored and inserted into the candidate queue before the triggering node is collected.visitLimitallows.Join module (lucene/join)
DiversifyingNearestChildrenKnnCollectornow implementsDocSiblingExpansion, returning the sibling doc IDs for a given child.DiversifyingNearestChildrenKnnCollectorManagerbuilds a docId-to-ordinal mapping used to translate siblings from doc-ID space back to vector ordinals.OrdinalTranslatedKnnCollectorwires the two interfaces together.topDocs(), simplified the heap update path (unnecessaryupHeapbranch removed), and changeddownHeapreturn type from int to void.Docid-to-Ordinal Cache
Sibling expansion requires translating child document IDs to HNSW vector ordinals at search time. To avoid rebuilding this mapping on every query, a segment-level docId-to-ordinal cache is introduced in
DiversifyingNearestChildrenKnnCollectorManager.It is populated lazily on the first query against a given segment+field combination and evicted automatically via addClosedListener when the segment closes — no manual lifecycle management required.
Tests
TestDiversifyingChildrenKnnSiblingExpansion— comprehensive test suite covering correctness, early termination, and recall improvement with sibling expansion enabled vs. disabled.DiversifyingChildrenKnnCollectorTestCase— shared test case base class.Benchmarks
DiversifyingChildrenKnnQueryBenchmark— JMH benchmark covering multiple children-per-parent and k configurations, with and without the ordinal cache, to quantify the performance impact of sibling expansion.Results
Benchmarks were run with JMH (Mode.AverageTime, 3 forks × 5 measurement iterations × 1 s each, 4 warmup iterations) on a single-segment index of 5,000 parents, 128-dimensional float vectors, DOT_PRODUCT similarity.
Three sibling-correlation scenarios are measured:
The table below compares main (no sibling expansion) against this branch (sibling). Lower is better (ms/op).
Key observations:
Final Considerations
Sibling expansion is always slower in these results because it always adds work.
For every candidate child found via HNSW, sibling expansion additionally:
That's extra distance computations and memory accesses on every candidate, regardless of whether siblings are useful.
Good siblings help HNSW terminate earlier by quickly raising the competitive score threshold:
In the best case, finding one good child means all its siblings score well too, rapidly raising the threshold and letting HNSW skip exploring many nodes. The expansion cost is partially offset by shorter traversal.
In the worst case, siblings score poorly, so the threshold rises slowly, HNSW explores more nodes, and you pay the expansion cost on every candidate — pure overhead with no benefit.
The problem without sibling expansion
HNSW explores the vector graph and finds candidate children. When it finds child C1 of parent P1, it scores P1 with C1's score. But P1 might have another child, C2, that scores much higher, and HNSW might never visit C2 because it's not a graph neighbor of C1.
Without sibling expansion, a parent can be under-scored or entirely missed because HNSW happened to find its weakest child first. The final ranking of parents is wrong.
Benefits:
"Cons":
If HNSW has already found P1 via C1, sibling expansion just refines P1's score. It doesn't help discover P2, P3, ... Pk faster. You're spending time on a parent already in the result set instead of exploring graph edges toward new undiscovered parents.
Looking at the numbers, sibling expansion is always slower — even in the best scenario. The early termination benefit never outweighs the sibling-checking cost.
The real benefit is recall/precision, not speed.