DiversifyingChildren speedup - siblings expansion by aruggero · Pull Request #16034 · apache/lucene

aruggero · 2026-05-05T13:46:54Z

Sibling Expansion for DiversifyingChildrenKnnQuery HNSW Search

Summary

This contribution introduces sibling expansion as an optimization for KNN vector search over parent-child document relationships (i.e., DiversifyingChildrenFloatKnnVectorQuery / DiversifyingChildrenByteKnnVectorQuery).

When the HNSW graph searcher encounters a child node belonging to a newly discovered parent, all siblings of that parent (other children of the same parent not yet visited) are immediately scored and collected — without requiring further graph traversal to reach them. This improves recall for nested document use cases where multiple child vectors share a parent, as siblings that are close in the document structure may not be well-connected in the HNSW graph.

Changes

New interfaces (lucene/core)

ChildrenSiblingExpansion — implemented by KnnCollector instances that support ordinal-level sibling expansion during HNSW search. The searcher calls pendingSiblingOrdinals() before collecting a node to get unvisited siblings to score immediately.
DocSiblingExpansion — doc-ID-level companion used by OrdinalTranslatedKnnCollector to bridge between HNSW ordinal space and collector doc-ID space.

Core HNSW searcher (AbstractHnswGraphSearcher)

Added sibling scoring logic: if the collector implements ChildrenSiblingExpansion, siblings are bulk-scored and inserted into the candidate queue before the triggering node is collected.
Respects the visit budget: only as many siblings are scored as the remaining visitLimit allows.

Join module (lucene/join)

DiversifyingNearestChildrenKnnCollector now implements DocSiblingExpansion, returning the sibling doc IDs for a given child.
DiversifyingNearestChildrenKnnCollectorManager builds a docId-to-ordinal mapping used to translate siblings from doc-ID space back to vector ordinals.
OrdinalTranslatedKnnCollector wires the two interfaces together.
Minor cleanup: removed a dead while (heap.size() > k()) loop in topDocs(), simplified the heap update path (unnecessary upHeap branch removed), and changed downHeap return type from int to void.

Docid-to-Ordinal Cache
Sibling expansion requires translating child document IDs to HNSW vector ordinals at search time. To avoid rebuilding this mapping on every query, a segment-level docId-to-ordinal cache is introduced in DiversifyingNearestChildrenKnnCollectorManager.
It is populated lazily on the first query against a given segment+field combination and evicted automatically via addClosedListener when the segment closes — no manual lifecycle management required.

Tests

TestDiversifyingChildrenKnnSiblingExpansion — comprehensive test suite covering correctness, early termination, and recall improvement with sibling expansion enabled vs. disabled.
DiversifyingChildrenKnnCollectorTestCase — shared test case base class.

Benchmarks

DiversifyingChildrenKnnQueryBenchmark — JMH benchmark covering multiple children-per-parent and k configurations, with and without the ordinal cache, to quantify the performance impact of sibling expansion.

Results

Benchmarks were run with JMH (Mode.AverageTime, 3 forks × 5 measurement iterations × 1 s each, 4 warmup iterations) on a single-segment index of 5,000 parents, 128-dimensional float vectors, DOT_PRODUCT similarity.

Three sibling-correlation scenarios are measured:

best — siblings nearly identical (noise = 0.05); early HNSW termination is expected.
standard — siblings moderately correlated (noise = 0.30); realistic production case.
worst — siblings fully random; expansion fires but provides no recall benefit (pure overhead measurement).

The table below compares main (no sibling expansion) against this branch (sibling). Lower is better (ms/op).

children	k	correlation	main (ms/op)	sibling (ms/op)	overhead
4	10	best	0.070 ± 0.003	0.074 ± 0.003	+5.7%
4	10	standard	0.053 ± 0.003	0.060 ± 0.004	+13.2%
4	10	worst	0.050 ± 0.005	0.056 ± 0.003	+12.0%
4	100	best	0.400 ± 0.013	0.448 ± 0.008	+12.0%
4	100	standard	0.251 ± 0.012	0.321 ± 0.004	+27.9%
4	100	worst	0.270 ± 0.026	0.328 ± 0.011	+21.5%
8	10	best	0.101 ± 0.005	0.106 ± 0.007	+5.0%
8	10	standard	0.065 ± 0.003	0.084 ± 0.004	+29.2%
8	10	worst	0.064 ± 0.003	0.086 ± 0.005	+34.4%
8	100	best	0.642 ± 0.019	0.711 ± 0.041	+10.7%
8	100	standard	0.330 ± 0.027	0.499 ± 0.030	+51.2%
8	100	worst	0.307 ± 0.016	0.512 ± 0.028	+66.8%
16	10	best	0.147 ± 0.004	0.165 ± 0.008	+12.2%
16	10	standard	0.080 ± 0.004	0.128 ± 0.007	+60.0%
16	10	worst	0.075 ± 0.005	0.125 ± 0.007	+66.7%
16	100	best	0.985 ± 0.053	1.211 ± 0.047	+22.9%
16	100	standard	0.568 ± 0.022	0.815 ± 0.032	+43.5%
16	100	worst	0.496 ± 0.021	0.863 ± 0.046	+74.0%

Key observations:

Best-case overhead is low (~5–12%): when siblings are nearly identical, HNSW early termination kicks in after a parent is discovered, partially offsetting the cost of scoring extra nodes.
Overhead grows with children-per-parent: scoring more siblings per parent naturally increases latency; with 16 children, the overhead reaches ~67–74% in the worst case.
Overhead grows with k: larger result sets require exploring more of the graph before early termination can trigger, leaving more room for sibling scoring to accumulate.
Worst-case overhead is the upper bound: the worst scenario (fully random siblings) represents the theoretical ceiling — sibling expansion fires on every parent discovery but contributes nothing to recall. Real-world data almost always falls between standard and best.
This benchmark measures latency only, not recall. The primary motivation for sibling expansion is improving recall for correlated child vectors, a trade-off that is not captured here.

Final Considerations

Sibling expansion is always slower in these results because it always adds work.

For every candidate child found via HNSW, sibling expansion additionally:

Reads all sibling vectors from the index
Computes their dot-product scores
Tries to insert them into the top-k heap

That's extra distance computations and memory accesses on every candidate, regardless of whether siblings are useful.
Good siblings help HNSW terminate earlier by quickly raising the competitive score threshold:

Scenario	Siblings score	Threshold rises	HNSW early exit	Net overhead
best	High (nearly identical)	Fast	Yes	Small (~5-12%)
standard	Moderate	Moderate	Partial	Medium (~13-60%)
worst	Random/low	Barely	No	Large (~12-74%)

In the best case, finding one good child means all its siblings score well too, rapidly raising the threshold and letting HNSW skip exploring many nodes. The expansion cost is partially offset by shorter traversal.
In the worst case, siblings score poorly, so the threshold rises slowly, HNSW explores more nodes, and you pay the expansion cost on every candidate — pure overhead with no benefit.

The problem without sibling expansion
HNSW explores the vector graph and finds candidate children. When it finds child C1 of parent P1, it scores P1 with C1's score. But P1 might have another child, C2, that scores much higher, and HNSW might never visit C2 because it's not a graph neighbor of C1.

Without sibling expansion, a parent can be under-scored or entirely missed because HNSW happened to find its weakest child first. The final ranking of parents is wrong.

Benefits:

With sibling expansion, when a parent enters the result set, it's guaranteed to be scored by its actual best child, giving correct relative ranking between parents.
Better parent scores raise the HNSW competitive threshold faster, which is exactly why the best scenario (identical siblings) shows the smallest overhead — finding any child immediately gives the true parent score, and HNSW can prune more aggressively.

"Cons":
If HNSW has already found P1 via C1, sibling expansion just refines P1's score. It doesn't help discover P2, P3, ... Pk faster. You're spending time on a parent already in the result set instead of exploring graph edges toward new undiscovered parents.
Looking at the numbers, sibling expansion is always slower — even in the best scenario. The early termination benefit never outweighs the sibling-checking cost.

The real benefit is recall/precision, not speed.

aruggero · 2026-05-05T13:54:44Z

Hi @benwtrent,
I worked on the sibling expansion topic you discussed via email with @alessandrobenedetti.

From your conversation:

I wonder if we can cheat and when we find a nearest child, we simply gather and score ALL children of a parent ord, expecting them all to be near and bulk collecting and scoring them. This sort of dynamic exploration would allow min, max, and average score exploration (at some extra graph exploration cost). It might even make the baseline max score exploration faster. This will take some refactoring. I think if we made the KnnCollector interface keep track of the visited set, it could be done. It also unlocks things in Elastic search & Open search as we periodically want the nearest top paragraphs for each nearest parent doc.

Sorry for the long description of this draft PR!
I reported all the changes made for the implementation, and mostly the benchmark results.
It seems that this approach can improve the precision/recall of the returned results, but not the overall time required for the computation.

Let me know what you think about this :)

benwtrent · 2026-05-05T14:56:58Z

    assert scores != null && scores.length >= eps.length;
    scorer.bulkScore(eps, scores, eps.length);
    results.incVisitedCount(eps.length);
+    float[] siblingScores = null;


So, let's not do it in the entry point exploration. I think just doing max there is the best way.

Hi @benwtrent,
From what @alessandrobenedetti and I had seen, scoreEntryPoints() is done only once (multiple times only with a seeded query in knn), since we have:

org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search
search() calling findBestEntryPoint() + searchLevel()

org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel
and searchLevel() (which calls scoreEntryPoints()) is called only on level 0

It shouldn't add too much work here, right?

Moving this to the search-only part would break some assertions on visits that we would need to manage otherwise...

scoreEntryPoints is about getting to the best approximate option in the bottom layer. I don't think it adds too much work, but I wonder it is helping at all, and I suspect it isn't, and thus shouldn't be done.

benwtrent · 2026-05-05T14:59:04Z

+              // Fetch siblings BEFORE collect() so the parent is not yet in the heap
+              int[] siblings = null;


Let's reuse scratch space (it won't add much, but we definitely shouldn't be creating a new int[] on every sibling expansion.)

I would adjust siblings = expander.pendingSiblingOrdinals(node, visited, siblings); to allow reusing scratch space or expanding it and then reusing it later.

Addressed, let me know if the new implementation is what you expected.

benwtrent · 2026-05-05T15:11:30Z

@aruggero thank you for the first pass here!

Hmmm, it is frustrating that bulk scoring the children doesn't help much. I guess it makes sense as we are forcing more scoring per node.

I do think we shouldn't do the expansion when gathering the entry point.

I wonder if there is a "hybrid" scenario where we don't expand until the collector queue is full....

Interesting numbers for sure!

…mntInspectChildrenSibiling

…iables names

aruggero · 2026-05-07T14:53:35Z

+  // HNSW nodes are identified by their ordinal (the position in the flat vector store). So when the searcher returns
+  // ordinal k as a graph node, docToOrd[docId] = k being correct means docIdToOrdinal will find the right HNSW node
+  // for any sibling docId.
+  private int[] buildDocToOrd(LeafReaderContext context) throws IOException {


Sorry for the big comment above @benwtrent
Just to remember why some choices were made :)
Could you check if this method is correctly implemented?

We should be inside a single segment at this point, right?
And both ordinals and docids are segment-related (not globally unique), right?

aruggero · 2026-05-11T14:32:44Z

Here are the new benchmark results, thanks to the scratch space reuse:

children	k	correlation	main (ms/op)	sibling (ms/op)	overhead
4	10	best	0.070 ± 0.003	0.072 ± 0.003	+2.9%
4	10	standard	0.053 ± 0.003	0.058 ± 0.004	+9.4%
4	10	worst	0.050 ± 0.005	0.057 ± 0.003	+14.0%
4	100	best	0.400 ± 0.013	0.452 ± 0.016	+13.0%
4	100	standard	0.251 ± 0.012	0.309 ± 0.012	+23.1%
4	100	worst	0.270 ± 0.026	0.305 ± 0.009	+13.0%
8	10	best	0.101 ± 0.005	0.109 ± 0.006	+7.9%
8	10	standard	0.065 ± 0.003	0.078 ± 0.003	+20.0%
8	10	worst	0.064 ± 0.003	0.080 ± 0.003	+25.0%
8	100	best	0.642 ± 0.019	0.716 ± 0.017	+11.5%
8	100	standard	0.330 ± 0.027	0.486 ± 0.027	+47.3%
8	100	worst	0.307 ± 0.016	0.488 ± 0.028	+59.0%
16	10	best	0.147 ± 0.004	0.151 ± 0.008	+2.7%
16	10	standard	0.080 ± 0.004	0.109 ± 0.007	+36.3%
16	10	worst	0.075 ± 0.005	0.107 ± 0.002	+42.7%
16	100	best	0.985 ± 0.053	1.144 ± 0.040	+16.1%
16	100	standard	0.568 ± 0.022	0.858 ± 0.071	+51.1%
16	100	worst	0.496 ± 0.021	0.880 ± 0.052	+77.4%

Scenario	Siblings score	Threshold rises	HNSW early exit	Previous overhead	Current overhead
best	High (nearly identical)	Fast	Yes	~5–12%	~3–16%
standard	Moderate	Moderate	Partial	~13–60%	~9–51%
worst	Random/low	Barely	No	~12–74%	~13–77%

The main change worth calling out:

standard improved meaningfully at the top end (60% → 51%) thanks to scratch space reuse — that's the case most representative of real-world data.
The best lower bound dropped to 3% (nearly free for well-correlated siblings with small k).
The worst upper bound nudged up slightly (74% → 77%), but that's within benchmark noise at children=16, k=100.

We still have a significant overhead in general.

benwtrent · 2026-05-11T14:53:56Z

@aruggero wow, thank you a bunch for all the benchmarks and work!

man, it is frustrating how this doesn't give us ANYTHING out of the box :( I suppose it makes sense. The benefit of HNSW is that its a very sparse graph with big jumps, so multiplying ops per connection are nasty. I was sort of holding out hope that we would get a cheeky perf bump :( Especially since bulk scoring is usually much faster (since vectors are also near each other in memory) and paragraphs/sub-vectors should be near one another in space already.

While I am not sure this is something we should provide OOTB for all diverse children queries, it does seem neat to be able to provide "give me parent docs based on the FURTHEST child", this forces even the most irrelevant child to be considered.

aruggero · 2026-05-13T09:35:47Z

@benwtrent, even if this implementation is not worth contributing, @alessandrobenedetti and I were wondering if maybe the benchmark is?

aruggero · 2026-05-13T09:37:29Z

@aruggero wow, thank you a bunch for all the benchmarks and work!

man, it is frustrating how this doesn't give us ANYTHING out of the box :( I suppose it makes sense. The benefit of HNSW is that its a very sparse graph with big jumps, so multiplying ops per connection are nasty. I was sort of holding out hope that we would get a cheeky perf bump :( Especially since bulk scoring is usually much faster (since vectors are also near each other in memory) and paragraphs/sub-vectors should be near one another in space already.

While I am not sure this is something we should provide OOTB for all diverse children queries, it does seem neat to be able to provide "give me parent docs based on the FURTHEST child", this forces even the most irrelevant child to be considered.

Do you mean like sorting the results by the "best-worst" child of a parent?
E.g., the worst child of parent 1 is better than the worst child of parent 2 and therefore should come earlier in the result list?

aruggero added 8 commits April 23, 2026 11:11

Removed unuseful checks

2190c70

Removed unused variable

66390c8

Moved downHeap from int to void since the returned value is never used

da4ec3d

Implemented siblings exploration

2f33476

Added benchmark for measuring speed

b1f6b86

Added cache and update benchmark for testing different scenarios

9ab84ad

Updated tests for triggering early termination

07421fb

Added comments to tests

8fae7fb

github-actions Bot added module:join module:core/hnsw module:build-infra labels May 5, 2026

aruggero added 2 commits May 5, 2026 15:57

Gradlew tidy

32d406f

Gradlew check

5dbf0f7

benwtrent reviewed May 5, 2026

View reviewed changes

benwtrent mentioned this pull request May 5, 2026

Nested KNN vs inner hits scoring consistency elastic/elasticsearch#144548

Open

aruggero added 10 commits May 6, 2026 09:52

Merge remote-tracking branch 'upStream/main' into diversifyingImprove…

2f97ab8

…mntInspectChildrenSibiling

Small names refactoring

0448dcf

Removed ChildrenSiblingExpansion interface

ebc30a0

Analyzed if DocSiblingExpansion could be removed -> NO

ace844f

Removed return null from getSiblingOrdinals. Empty array given.

3154622

Returned empty array instead of null in findiSiblingDocIds

286d7ab

Checked if numHnswNodes needed in scoreHnswNodes and changed some var…

c9481f7

…iables names

Return empty array instead of null in buildDocToOrd

70ea159

Why check on field info is needed

0d6154d

Addressed Ben comments about reusing scratch space

97828c6

aruggero commented May 7, 2026

View reviewed changes

Comment thread lucene/core/src/java/org/apache/lucene/util/hnsw/AbstractHnswGraphSearcher.java

aruggero commented May 7, 2026

View reviewed changes

aruggero added 2 commits May 7, 2026 16:58

Gradlew tidy

51ddd9f

Changes.txt

9b3aae0

github-actions Bot added this to the 11.0.0 milestone May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiversifyingChildren speedup - siblings expansion#16034

DiversifyingChildren speedup - siblings expansion#16034
aruggero wants to merge 22 commits into
apache:mainfrom
SeaseLtd:diversifyingImprovemntInspectChildrenSibiling

aruggero commented May 5, 2026

Uh oh!

aruggero commented May 5, 2026

Uh oh!

benwtrent May 5, 2026

Uh oh!

aruggero May 7, 2026

Uh oh!

benwtrent May 7, 2026

Uh oh!

benwtrent May 5, 2026

Uh oh!

aruggero May 7, 2026

Uh oh!

benwtrent commented May 5, 2026

Uh oh!

Uh oh!

aruggero May 7, 2026

Uh oh!

aruggero commented May 11, 2026 •

edited

Loading

Uh oh!

benwtrent commented May 11, 2026

Uh oh!

aruggero commented May 13, 2026

Uh oh!

aruggero commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Fetch siblings BEFORE collect() so the parent is not yet in the heap
		int[] siblings = null;

Conversation

aruggero commented May 5, 2026

Sibling Expansion for DiversifyingChildrenKnnQuery HNSW Search

Summary

Changes

Tests

Benchmarks

Results

Final Considerations

Uh oh!

aruggero commented May 5, 2026

Uh oh!

benwtrent May 5, 2026

Choose a reason for hiding this comment

Uh oh!

aruggero May 7, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent May 7, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent May 5, 2026

Choose a reason for hiding this comment

Uh oh!

aruggero May 7, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent commented May 5, 2026

Uh oh!

Uh oh!

aruggero May 7, 2026

Choose a reason for hiding this comment

Uh oh!

aruggero commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented May 11, 2026

Uh oh!

aruggero commented May 13, 2026

Uh oh!

aruggero commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aruggero commented May 11, 2026 •

edited

Loading