Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#514
Optimizes reducelanes in diversityCalculation of PQVectors, for Euclidean function#514shyla226 wants to merge 5 commits intodatastax:mainfrom
Conversation
a991b33 to
dd1c332
Compare
|
This PR does successfully remove the calls to reduceLanes from the inside of the loop iterating over the subspaces the Euclidean case. I do have a concern about applying this pattern to one specific case but leaving the handling of other cases as is, leaving the code with 2 different implementations for the same problem. Some questions:
|
|
@MarkWolters, thank you very much for the feedback. I can make the changes to dotProduct methods and to function calls in scoreFunctionFor() |
|
@shyla226 please be aware that in order to pass the automated GitHub action regression test any pull request must come from a branch inside the datastax/jvector repository and not a fork of the repository. So while I am willing to review the PR from a fork and provide commentary, in order for it to pass the requirements for merging it will have to be a branch and not a fork. |
|
@MarkWolters, Sounds good! I will create a branch |
|
@shyla226 I don't think you'll have sufficient permissions to push a new branch to the repository. For now you can keep the changes on a fork and they can be reviewed there while we come up with a solution to this issue. |
|
@MarkWolters , I have added new functions to include dotproduct & cosine for diversityFunction. |
|
Thanks @shyla226, I will review the updated code asap |
|
Thank you very much @MarkWolters . With the changes, when M=64, dot product shows ~30% reduction in latency & COSINE shows ~43% reduction |
|
@MarkWolters sorry, accidentally closed it! I will update the test setup used in the description above. |
|
@MarkWolters, I have addressed the issue with the 2 failed checks and tested the code changes on Windows server |
|
@shyla226 apologies for not addressing this sooner. I have created a branch named |
|
@MarkWolters Thank you very much. I will rebase the changes. |
c2a8068 to
8e85edb
Compare
| return pqScoreDotProduct_64(codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); | ||
| } | ||
|
|
||
| float pqScoreCosine_512(VectorFloat<?>[] codebooks, int[][] subvectorSizesAndOffsets, ByteSequence<?> node1Chunk, int node1Offset, ByteSequence<?> node2Chunk, int node2Offset, int subspaceCount) { |
There was a problem hiding this comment.
This name is a mis-leading, probably better to rename this to pqScoreCosine_preferred?
| int length1 = centroidIndex1 * centroidLength; | ||
| int length2 = centroidIndex2 * centroidLength; | ||
|
|
||
| if (centroidLength == FloatVector.SPECIES_PREFERRED.length()) { |
There was a problem hiding this comment.
Why do we need this special case? It introduces a lot of duplicated code across all the functions. Wouldn’t the loop below already handle this scenario correctly?
| else if (subvectorSizesAndOffsets[0][0] >= FloatVector.SPECIES_128.length()) { | ||
| return pqScoreEuclidean_128( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); | ||
| } | ||
| return pqScoreEuclidean_64( codebooks, subvectorSizesAndOffsets, node1Chunk, node1Offset, node2Chunk, node2Offset, subspaceCount); |
There was a problem hiding this comment.
We currently have 4 separate implementations (pqScoreEuclidean_64/_128/_256/_512) that are structurally identical aside from the FloatVector.SPECIES_* used. This is a lot of duplicated hot-path code and increases the risk of fixes/optimizations diverging across variants.
Could we fold these into a single generic-ish implementation that takes the species as a parameter, e.g.:
pqScoreEuclideanImpl(VectorSpecies<Float> species, ...)pqScoreEuclidean(...)selectsspecies(SPECIES_PREFERRED,SPECIES_256,SPECIES_128,SPECIES_64) once and dispatches to the shared impl.
This keeps the vectorized logic in one place and avoids the confusing _512 naming given it actually uses SPECIES_PREFERRED on some paths. It should also make it much easier to keep Euclidean/Dot/Cosine implementations consistent.
|
Probably would have been better to comment in #623, apologies for the duplicate review. Ignore the comments here. |
…ged the code for clarity 3) Added documentation
This pull request introduces improvement to Euclidean similarity function in
PQVectors.diversityFunctionFor. From flamegraph, it is observed that considerable amount of time is spent injdk/incubator/vector/FloatVector.reducelanesTemplate. This is mainly becauseFloatVector.reducelanes()is expensive and it is being called inside aforloop (viaVectorUtil.squareL2Distance). Modification in this pull request moves call toreduceLanes()outside theforloop.Change proposed here was tested with the benchmark,
PQDistanceCalculationBenchmark.diversityCalculationWith this benchmark, ~18% reduction in time was observed when M=64 and ~22% when M=192.
Code modifications:
pqDiversityEuclideaninVectorUtilSupportand its corresponding implementationsforloop inPQVectors.diversityFunctionForand moved it intopqDiversityEuclideanFloatVector.reducelanes()outside theforloopTest setup:
Jvector version : main branch (as of 2025-08-28)
JDK version : openjdk version "24.0.2" 2025-07-15
Platform : INTEL(R) XEON(R) PLATINUM 8592+
Benchmark : PQDistanceCalculationBenchmark.diversityCalculation
New changes
I have modified the code to include dot product & cosine functions and implemented similar changes for scoreFunctionFor.
With the changes applied to scoreFunctionFor, when M=64: dot product shows ~30% reduction in latency & cosine shows ~43% reduction.
I can add data points for other subspace counts, if required.
Code modifications:
pqScoreEuclidean, pqScoreDotProduct, pqScoreCosineinVectorUtilSupportand its corresponding implementations fordiversityFunctionForscoreFunctionForforloop inPQVectors.diversityFunctionForandPQVectors.scoreFunctionForand moved them into respective functions inPanamaVectorUtilSupportFloatVector.reducelanes()outside theforloopMutablePQVectorsto test thisTest setup:
Jvector version : main branch (as of 2025-10-22)
JDK version : openjdk version "25.0.1" 2025-10-21
Platform : Intel(R) Xeon(R) 6979P
Benchmark :
PQDistanceCalculationMutableVectorBenchmark(Added this by replicatingPQDistanceCalculationBenchmarkto measure performance for MutablePQVectors)