Implement data-blind scalar quantization by mccullocht · Pull Request #16030 · apache/lucene

mccullocht · 2026-05-04T05:02:31Z

Add an option to the quantization format to enable or disable centering (enabled by default). When centering is disabled we also stop writing the float vectors which can lead to significant storage savings. Special handling is included during merges -- we check that all of the input is in the same encoding, and handle transcoding if some of the input is float vectors.

Large portions of this change were generated using claude code. I reviewed, tweaked, and tested the code before putting it up for review.

This change is being made as a new codec as the format changes to drop the center vector when centering is disabled. This is not strictly necessary as we could write a zero vector instead, but I have plans to make other format changes related to data blindness, see #16029.

luceneutil results -- 1M cohere vectors, 8 bit quantization.
before:

recall  latency(ms)  netCPU  avgCpuCount     nDoc  searchType  topK  fanout  resultSimilarity  decay  resultCount  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.974        2.304   2.297        0.997  1000000         KNN   100     100               N/A    N/A      100.000       64        250     8 bits     8619    132.85       7527.40          235.00             1         5047.27            null                N/A       1.000      4898.071      991.821       false       HNSW

after

recall  latency(ms)  netCPU  avgCpuCount     nDoc  searchType  topK  fanout  resultSimilarity  decay  resultCount  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  filterStrategy  filterSelectivity  overSample  vec_disk(MB)  vec_RAM(MB)  bp-reorder  indexType
 0.972        2.281   2.274        0.997  1000000         KNN   100     100               N/A    N/A      100.000       64        250     8 bits     8612    143.06       6990.07          160.33             1         1140.98            null                N/A       1.000      4898.071      991.821       false       HNSW

The harness extrapolates vector size from the input size so believe the on-disk index_size number -- this is about 4x smaller. Force merge is faster since we don't have to re-quantize vectors on merge. Recall is very similar but YMMV.

Allow callers to disable centering at the format level, which also disables writing of float vectors since they are no longer needed. Includes a path to handle of mix of centered and uncentered segments as input. In this case the uncentered/no float vectors will be dequantized and requantized but this case should be relatively uncommon. Includes OSQ changes to allow a zero vector for COSINE if the vector is not a unit vector. Maybe fix this in upstream callers?

mccullocht added 7 commits May 2, 2026 21:51

audit of format + reader

af00705

fix check

fc84a81

fix field writer to better match lucene99 raw writer

f37bdab

move to lucene105

818fe85

change wire format to omit center on uncentered data sets

153c28c

propagate parameter to hnsw codec

27c5861

mccullocht added this to the 10.5.0 milestone May 4, 2026

github-actions Bot added the module:core/codecs label May 4, 2026

ch-ch-ch-changes

b9c9cc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement data-blind scalar quantization#16030

Implement data-blind scalar quantization#16030
mccullocht wants to merge 8 commits into
apache:mainfrom
mccullocht:sq-data-blind

mccullocht commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mccullocht commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mccullocht commented May 4, 2026 •

edited

Loading