Skip to content

DiversifyingChildren speedup - siblings expansion#16034

Draft
aruggero wants to merge 22 commits into
apache:mainfrom
SeaseLtd:diversifyingImprovemntInspectChildrenSibiling
Draft

DiversifyingChildren speedup - siblings expansion#16034
aruggero wants to merge 22 commits into
apache:mainfrom
SeaseLtd:diversifyingImprovemntInspectChildrenSibiling

Conversation

@aruggero
Copy link
Copy Markdown

@aruggero aruggero commented May 5, 2026

Sibling Expansion for DiversifyingChildrenKnnQuery HNSW Search

Summary

This contribution introduces sibling expansion as an optimization for KNN vector search over parent-child document relationships (i.e., DiversifyingChildrenFloatKnnVectorQuery / DiversifyingChildrenByteKnnVectorQuery).

When the HNSW graph searcher encounters a child node belonging to a newly discovered parent, all siblings of that parent (other children of the same parent not yet visited) are immediately scored and collected — without requiring further graph traversal to reach them. This improves recall for nested document use cases where multiple child vectors share a parent, as siblings that are close in the document structure may not be well-connected in the HNSW graph.

Changes

New interfaces (lucene/core)

  • ChildrenSiblingExpansion — implemented by KnnCollector instances that support ordinal-level sibling expansion during HNSW search. The searcher calls pendingSiblingOrdinals() before collecting a node to get unvisited siblings to score immediately.
  • DocSiblingExpansion — doc-ID-level companion used by OrdinalTranslatedKnnCollector to bridge between HNSW ordinal space and collector doc-ID space.

Core HNSW searcher (AbstractHnswGraphSearcher)

  • Added sibling scoring logic: if the collector implements ChildrenSiblingExpansion, siblings are bulk-scored and inserted into the candidate queue before the triggering node is collected.
  • Respects the visit budget: only as many siblings are scored as the remaining visitLimit allows.

Join module (lucene/join)

  • DiversifyingNearestChildrenKnnCollector now implements DocSiblingExpansion, returning the sibling doc IDs for a given child.
  • DiversifyingNearestChildrenKnnCollectorManager builds a docId-to-ordinal mapping used to translate siblings from doc-ID space back to vector ordinals.
  • OrdinalTranslatedKnnCollector wires the two interfaces together.
  • Minor cleanup: removed a dead while (heap.size() > k()) loop in topDocs(), simplified the heap update path (unnecessary upHeap branch removed), and changed downHeap return type from int to void.

Docid-to-Ordinal Cache
Sibling expansion requires translating child document IDs to HNSW vector ordinals at search time. To avoid rebuilding this mapping on every query, a segment-level docId-to-ordinal cache is introduced in DiversifyingNearestChildrenKnnCollectorManager.
It is populated lazily on the first query against a given segment+field combination and evicted automatically via addClosedListener when the segment closes — no manual lifecycle management required.

Tests

  • TestDiversifyingChildrenKnnSiblingExpansion — comprehensive test suite covering correctness, early termination, and recall improvement with sibling expansion enabled vs. disabled.
  • DiversifyingChildrenKnnCollectorTestCase — shared test case base class.

Benchmarks

  • DiversifyingChildrenKnnQueryBenchmark — JMH benchmark covering multiple children-per-parent and k configurations, with and without the ordinal cache, to quantify the performance impact of sibling expansion.

Results

Benchmarks were run with JMH (Mode.AverageTime, 3 forks × 5 measurement iterations × 1 s each, 4 warmup iterations) on a single-segment index of 5,000 parents, 128-dimensional float vectors, DOT_PRODUCT similarity.

Three sibling-correlation scenarios are measured:

  • best — siblings nearly identical (noise = 0.05); early HNSW termination is expected.
  • standard — siblings moderately correlated (noise = 0.30); realistic production case.
  • worst — siblings fully random; expansion fires but provides no recall benefit (pure overhead measurement).

The table below compares main (no sibling expansion) against this branch (sibling). Lower is better (ms/op).

children k correlation main (ms/op) sibling (ms/op) overhead
4 10 best 0.070 ± 0.003 0.074 ± 0.003 +5.7%
4 10 standard 0.053 ± 0.003 0.060 ± 0.004 +13.2%
4 10 worst 0.050 ± 0.005 0.056 ± 0.003 +12.0%
4 100 best 0.400 ± 0.013 0.448 ± 0.008 +12.0%
4 100 standard 0.251 ± 0.012 0.321 ± 0.004 +27.9%
4 100 worst 0.270 ± 0.026 0.328 ± 0.011 +21.5%
8 10 best 0.101 ± 0.005 0.106 ± 0.007 +5.0%
8 10 standard 0.065 ± 0.003 0.084 ± 0.004 +29.2%
8 10 worst 0.064 ± 0.003 0.086 ± 0.005 +34.4%
8 100 best 0.642 ± 0.019 0.711 ± 0.041 +10.7%
8 100 standard 0.330 ± 0.027 0.499 ± 0.030 +51.2%
8 100 worst 0.307 ± 0.016 0.512 ± 0.028 +66.8%
16 10 best 0.147 ± 0.004 0.165 ± 0.008 +12.2%
16 10 standard 0.080 ± 0.004 0.128 ± 0.007 +60.0%
16 10 worst 0.075 ± 0.005 0.125 ± 0.007 +66.7%
16 100 best 0.985 ± 0.053 1.211 ± 0.047 +22.9%
16 100 standard 0.568 ± 0.022 0.815 ± 0.032 +43.5%
16 100 worst 0.496 ± 0.021 0.863 ± 0.046 +74.0%

Key observations:

  • Best-case overhead is low (~5–12%): when siblings are nearly identical, HNSW early termination kicks in after a parent is discovered, partially offsetting the cost of scoring extra nodes.
  • Overhead grows with children-per-parent: scoring more siblings per parent naturally increases latency; with 16 children, the overhead reaches ~67–74% in the worst case.
  • Overhead grows with k: larger result sets require exploring more of the graph before early termination can trigger, leaving more room for sibling scoring to accumulate.
  • Worst-case overhead is the upper bound: the worst scenario (fully random siblings) represents the theoretical ceiling — sibling expansion fires on every parent discovery but contributes nothing to recall. Real-world data almost always falls between standard and best.
  • This benchmark measures latency only, not recall. The primary motivation for sibling expansion is improving recall for correlated child vectors, a trade-off that is not captured here.

Final Considerations

Sibling expansion is always slower in these results because it always adds work.

For every candidate child found via HNSW, sibling expansion additionally:

  • Reads all sibling vectors from the index
  • Computes their dot-product scores
  • Tries to insert them into the top-k heap

That's extra distance computations and memory accesses on every candidate, regardless of whether siblings are useful.
Good siblings help HNSW terminate earlier by quickly raising the competitive score threshold:

Scenario Siblings score Threshold rises HNSW early exit Net overhead
best High (nearly identical) Fast Yes Small (~5-12%)
standard Moderate Moderate Partial Medium (~13-60%)
worst Random/low Barely No Large (~12-74%)

In the best case, finding one good child means all its siblings score well too, rapidly raising the threshold and letting HNSW skip exploring many nodes. The expansion cost is partially offset by shorter traversal.
In the worst case, siblings score poorly, so the threshold rises slowly, HNSW explores more nodes, and you pay the expansion cost on every candidate — pure overhead with no benefit.

The problem without sibling expansion
HNSW explores the vector graph and finds candidate children. When it finds child C1 of parent P1, it scores P1 with C1's score. But P1 might have another child, C2, that scores much higher, and HNSW might never visit C2 because it's not a graph neighbor of C1.

Without sibling expansion, a parent can be under-scored or entirely missed because HNSW happened to find its weakest child first. The final ranking of parents is wrong.

Benefits:

  • With sibling expansion, when a parent enters the result set, it's guaranteed to be scored by its actual best child, giving correct relative ranking between parents.
  • Better parent scores raise the HNSW competitive threshold faster, which is exactly why the best scenario (identical siblings) shows the smallest overhead — finding any child immediately gives the true parent score, and HNSW can prune more aggressively.

"Cons":
If HNSW has already found P1 via C1, sibling expansion just refines P1's score. It doesn't help discover P2, P3, ... Pk faster. You're spending time on a parent already in the result set instead of exploring graph edges toward new undiscovered parents.
Looking at the numbers, sibling expansion is always slower — even in the best scenario. The early termination benefit never outweighs the sibling-checking cost.

The real benefit is recall/precision, not speed.

@aruggero
Copy link
Copy Markdown
Author

aruggero commented May 5, 2026

Hi @benwtrent,
I worked on the sibling expansion topic you discussed via email with @alessandrobenedetti.

From your conversation:

I wonder if we can cheat and when we find a nearest child, we simply gather and score ALL children of a parent ord, expecting them all to be near and bulk collecting and scoring them. This sort of dynamic exploration would allow min, max, and average score exploration (at some extra graph exploration cost). It might even make the baseline max score exploration faster. This will take some refactoring. I think if we made the KnnCollector interface keep track of the visited set, it could be done. It also unlocks things in Elastic search & Open search as we periodically want the nearest top paragraphs for each nearest parent doc.

Sorry for the long description of this draft PR!
I reported all the changes made for the implementation, and mostly the benchmark results.
It seems that this approach can improve the precision/recall of the returned results, but not the overall time required for the computation.

Let me know what you think about this :)

assert scores != null && scores.length >= eps.length;
scorer.bulkScore(eps, scores, eps.length);
results.incVisitedCount(eps.length);
float[] siblingScores = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, let's not do it in the entry point exploration. I think just doing max there is the best way.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @benwtrent,
From what @alessandrobenedetti and I had seen, scoreEntryPoints() is done only once (multiple times only with a seeded query in knn), since we have:

org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search
search() calling findBestEntryPoint() + searchLevel()

org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel
and searchLevel() (which calls scoreEntryPoints()) is called only on level 0

It shouldn't add too much work here, right?

Moving this to the search-only part would break some assertions on visits that we would need to manage otherwise...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scoreEntryPoints is about getting to the best approximate option in the bottom layer. I don't think it adds too much work, but I wonder it is helping at all, and I suspect it isn't, and thus shouldn't be done.

Comment on lines +351 to +352
// Fetch siblings BEFORE collect() so the parent is not yet in the heap
int[] siblings = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse scratch space (it won't add much, but we definitely shouldn't be creating a new int[] on every sibling expansion.)

I would adjust siblings = expander.pendingSiblingOrdinals(node, visited, siblings); to allow reusing scratch space or expanding it and then reusing it later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, let me know if the new implementation is what you expected.

@benwtrent
Copy link
Copy Markdown
Member

@aruggero thank you for the first pass here!

Hmmm, it is frustrating that bulk scoring the children doesn't help much. I guess it makes sense as we are forcing more scoring per node.

I do think we shouldn't do the expansion when gathering the entry point.

I wonder if there is a "hybrid" scenario where we don't expand until the collector queue is full....

Interesting numbers for sure!

// HNSW nodes are identified by their ordinal (the position in the flat vector store). So when the searcher returns
// ordinal k as a graph node, docToOrd[docId] = k being correct means docIdToOrdinal will find the right HNSW node
// for any sibling docId.
private int[] buildDocToOrd(LeafReaderContext context) throws IOException {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the big comment above @benwtrent
Just to remember why some choices were made :)
Could you check if this method is correctly implemented?

We should be inside a single segment at this point, right?
And both ordinals and docids are segment-related (not globally unique), right?

@github-actions github-actions Bot added this to the 11.0.0 milestone May 8, 2026
@aruggero
Copy link
Copy Markdown
Author

aruggero commented May 11, 2026

Here are the new benchmark results, thanks to the scratch space reuse:

children k correlation main (ms/op) sibling (ms/op) overhead
4 10 best 0.070 ± 0.003 0.072 ± 0.003 +2.9%
4 10 standard 0.053 ± 0.003 0.058 ± 0.004 +9.4%
4 10 worst 0.050 ± 0.005 0.057 ± 0.003 +14.0%
4 100 best 0.400 ± 0.013 0.452 ± 0.016 +13.0%
4 100 standard 0.251 ± 0.012 0.309 ± 0.012 +23.1%
4 100 worst 0.270 ± 0.026 0.305 ± 0.009 +13.0%
8 10 best 0.101 ± 0.005 0.109 ± 0.006 +7.9%
8 10 standard 0.065 ± 0.003 0.078 ± 0.003 +20.0%
8 10 worst 0.064 ± 0.003 0.080 ± 0.003 +25.0%
8 100 best 0.642 ± 0.019 0.716 ± 0.017 +11.5%
8 100 standard 0.330 ± 0.027 0.486 ± 0.027 +47.3%
8 100 worst 0.307 ± 0.016 0.488 ± 0.028 +59.0%
16 10 best 0.147 ± 0.004 0.151 ± 0.008 +2.7%
16 10 standard 0.080 ± 0.004 0.109 ± 0.007 +36.3%
16 10 worst 0.075 ± 0.005 0.107 ± 0.002 +42.7%
16 100 best 0.985 ± 0.053 1.144 ± 0.040 +16.1%
16 100 standard 0.568 ± 0.022 0.858 ± 0.071 +51.1%
16 100 worst 0.496 ± 0.021 0.880 ± 0.052 +77.4%
Scenario Siblings score Threshold rises HNSW early exit Previous overhead Current overhead
best High (nearly identical) Fast Yes ~5–12% ~3–16%
standard Moderate Moderate Partial ~13–60% ~9–51%
worst Random/low Barely No ~12–74% ~13–77%

The main change worth calling out:

  • standard improved meaningfully at the top end (60% → 51%) thanks to scratch space reuse — that's the case most representative of real-world data.
  • The best lower bound dropped to 3% (nearly free for well-correlated siblings with small k).
  • The worst upper bound nudged up slightly (74% → 77%), but that's within benchmark noise at children=16, k=100.

We still have a significant overhead in general.

@benwtrent
Copy link
Copy Markdown
Member

@aruggero wow, thank you a bunch for all the benchmarks and work!

man, it is frustrating how this doesn't give us ANYTHING out of the box :( I suppose it makes sense. The benefit of HNSW is that its a very sparse graph with big jumps, so multiplying ops per connection are nasty. I was sort of holding out hope that we would get a cheeky perf bump :( Especially since bulk scoring is usually much faster (since vectors are also near each other in memory) and paragraphs/sub-vectors should be near one another in space already.

While I am not sure this is something we should provide OOTB for all diverse children queries, it does seem neat to be able to provide "give me parent docs based on the FURTHEST child", this forces even the most irrelevant child to be considered.

@aruggero
Copy link
Copy Markdown
Author

@benwtrent, even if this implementation is not worth contributing, @alessandrobenedetti and I were wondering if maybe the benchmark is?

@aruggero
Copy link
Copy Markdown
Author

@aruggero wow, thank you a bunch for all the benchmarks and work!

man, it is frustrating how this doesn't give us ANYTHING out of the box :( I suppose it makes sense. The benefit of HNSW is that its a very sparse graph with big jumps, so multiplying ops per connection are nasty. I was sort of holding out hope that we would get a cheeky perf bump :( Especially since bulk scoring is usually much faster (since vectors are also near each other in memory) and paragraphs/sub-vectors should be near one another in space already.

While I am not sure this is something we should provide OOTB for all diverse children queries, it does seem neat to be able to provide "give me parent docs based on the FURTHEST child", this forces even the most irrelevant child to be considered.

Do you mean like sorting the results by the "best-worst" child of a parent?
E.g., the worst child of parent 1 is better than the worst child of parent 2 and therefore should come earlier in the result list?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants