diff --git a/proposed/0033-block-turboquant.md b/proposed/0033-block-turboquant.md index a41d244..1338e83 100644 --- a/proposed/0033-block-turboquant.md +++ b/proposed/0033-block-turboquant.md @@ -6,14 +6,19 @@ ## Summary -We propose modifying the TurboQuant vector quantization encoding to use uniform -64-dimensional blocks with per-block rotations instead of a single full-dimension -SRHT. Both the MSE quantization and the QJL correction operate per-block, -completely eliminating zero-padding waste and the associated QJL variance -inflation for non-power-of-2 dimensions. The lowest-level TurboQuant array -operates only on unit-norm vectors, with norms externalized. In a second stage, -the block structure enables a PDX-style dimension-major code layout for -SIMD-friendly vectorized distance computation. +We propose modifying the [TurboQuant vector quantization encoding][current-impl] +to use uniform blocks of a tunable power-of-2 size B (to be experimentally +determined; expected range 64-256) with per-block SORF rotations for the MSE +stage. The QJL correction uses per-block Gaussian projection matrices as +prescribed by [1], not SORF — this eliminates the need for power-of-2 padding in +the QJL stage. The lowest-level TurboQuant array operates only on unit-norm +vectors, with norms externalized. In a second stage, a PDX-style +dimension-major code layout within groups of 64 vectors enables SIMD-friendly +vectorized distance computation. The TQ block size B (for quantization quality) +and the PDX vector group size (always 64, for SIMD utilization) are independent +parameters. + +[current-impl]: https://github.com/vortex-data/vortex/pull/7167 ## Background @@ -30,21 +35,30 @@ embeddings. It works by: 3. Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction on the residual for unbiased inner product estimation (Theorem 2 in [1]). -The paper prescribes a full random orthogonal rotation via QR decomposition of a -Gaussian matrix — O(d²) storage and O(d²) per-vector. Our current -implementation substitutes a Structured Random Hadamard Transform (SRHT) for -O(d) storage and O(d log d) per-vector, following the well-established use of -SRHT as a fast approximate random rotation [2, 3]. +The paper prescribes a full random orthogonal rotation (QR of Gaussian) for the +MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper +uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an +orthogonal rotation); this distinction matters for the unbiasedness proof. + +Our [current implementation][current-impl] substitutes a 3-round Structured +Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5] for both the MSE +rotation and the QJL projection, giving O(d) storage and O(d log d) per-vector. +The 3-round SORF construction was introduced for kernel approximation [5] and +approximates a random orthogonal matrix. Note that this is distinct from the +single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of +Ailon-Chazelle [2], both of which are dimensionality-reducing projections. Using +SORF for QJL (where the paper prescribes Gaussian S) is an additional +approximation that may affect the unbiasedness guarantee of Theorem 2. ### Current limitations -The SRHT requires power-of-2 input dimension. For non-power-of-2 dimensions +The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions (e.g., 768-d embeddings from common transformer models), the input is zero-padded to the next power of 2 (1024). This causes: - **33% wasted storage** for 768-d vectors: 1024 quantized codes stored per vector when only 768 carry information. -- **QJL variance inflation**: The QJL correction also uses a padded SRHT. Of +- **QJL variance inflation**: The QJL correction also uses a padded SORF. Of the 1024 QJL sign bits, 256 encode information about the zero-padded region rather than the actual residual. The estimator remains unbiased in expectation, but the wasted measurement budget inflates variance. Empirically, this @@ -65,41 +79,50 @@ of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4]. ## Proposal -### Stage 1: 64-dim block decomposition +### Stage 1: Tunable block decomposition -Partition d dimensions into `⌈d/64⌉` blocks of 64 dimensions each. The last -block ("straggler") may have fewer than 64 dimensions — see Straggler handling -below. Each full block gets an independent 64-dim SRHT rotation. +Partition d dimensions into `⌈d/B⌉` blocks of B dimensions each, where B is a +power-of-2 block size (configurable; expected range 64-256, to be determined +experimentally). The last block ("straggler") may have fewer than B dimensions +— see Straggler handling below. Each full block gets an independent B-dim SORF +rotation for the MSE stage. + +Larger B gives better quantization quality (more concentrated coordinate +distribution, better SORF mixing) but more per-block norm overhead. Smaller B +gives more blocks (more norms) but each block has higher practical variance. +The optimal B should be determined experimentally — see Experimental plan. **Key properties:** -- **64 is a power of 2**, so every full block has zero padding waste. Common - embedding dimensions (128, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096) - are all divisible by 64, so the straggler case is rare. +- **B is a power of 2**, so every full block has zero SORF padding waste. + Common embedding dimensions are divisible by 64 and 128; most are also + divisible by 256. The straggler case is rare for well-chosen B. - **One shared centroid set** for full blocks. All full blocks use the same - 64-dim marginal distribution `(1 - x²)^(30.5)`, so a single Max-Lloyd - codebook serves every full block. + B-dim marginal distribution, so a single Max-Lloyd codebook serves every full + block. - **Unit-norm assumption.** The TurboQuant array operates only on pre-normalized sub-vectors. Per-block norms are externalized (see Norm architecture below). - This simplifies the encoding and enables efficient quantized-domain operations - without norm lookups. -- **Per-block rotation signs.** Each block's SRHT is independent (different - seed). Signs are 3 × 64 = 192 bits = 24 bytes per block. For 768-d: 12 × 24 - = 288 bytes total (less than the current 3 × 1024 = 384 bytes for a single - padded SRHT). +- **Per-block SORF rotation signs.** Each block's MSE SORF is independent + (different seed). Signs are 3 × B bits per block. For d=768 with B=128: + 6 blocks × 3 × 128 = 2304 bits = 288 bytes. #### Straggler handling -If `d % 64 ≠ 0`, the final block has `d mod 64` dimensions. The straggler block -uses a **dense random orthogonal matrix** (QR of Gaussian) instead of SRHT, -since SRHT requires power-of-2 dimension. At straggler dimensions ≤ 63, the -dense rotation is at most 63×63×4 = 15.9 KB — negligible. The straggler also -gets its own centroid set (computed for its actual dimension), since the -coordinate distribution differs from the 64-dim full blocks. +If `d % B ≠ 0`, the final block has `d mod B` dimensions. Initially, the +straggler is **zero-padded to B** and uses the same SORF and centroids as all +other blocks. This is the simplest approach and reuses all existing +infrastructure. The padding waste is bounded by B-1 dimensions — much smaller +than the current full-dimension padding (up to d-1 dimensions). + +As a follow-up experiment, we will test using a **dense random orthogonal +matrix** (QR of Gaussian) at the straggler's actual dimension with its own +centroid set. This eliminates the straggler's padding waste entirely and may +allow increasing B (since the straggler cost is bounded). This requires separate +storage for the straggler rotation matrix and centroids. #### Zero-norm sub-vectors -When splitting a vector into 64-dim blocks, some blocks may have zero norm +When splitting a vector into B-dim blocks, some blocks may have zero norm (e.g., sparse embeddings, or dimensions that are structurally unused). The encoding must handle ‖xₖ‖ = 0 explicitly: skip rotation and quantization for that block, store norm = 0, and decode as all zeros. The codes for a zero-norm @@ -126,92 +149,98 @@ The Pythagorean identity `Σ_k ‖xₖ‖² = ‖x‖²` ensures the weights sum Blocks with ‖xₖ‖ = 0 contribute zero error and are excluded from the sum. **Practical MSE vs. theoretical bound.** The bound `(√3π/2)/4^b` is a worst-case -upper bound that is dimension-independent. However, the **actual** MSE may -depend on dimension. At higher dimensions (d=768), the coordinate distribution -after rotation is more concentrated (variance ~1/768), and the Max-Lloyd -quantizer exploits this concentration — the actual MSE can be well below the -bound. At d=64, the distribution is wider (variance ~1/64), and the quantizer -has less concentration to exploit. - -The block decomposition preserves the theoretical bound but may give worse -**practical** MSE than a single full-dimension rotation, because each 64-dim +upper bound that is dimension-independent. However, the **actual** MSE depends +on the block dimension B. At higher dimensions, the coordinate distribution +after rotation is more concentrated (variance ~1/B), and the Max-Lloyd quantizer +exploits this concentration — the actual MSE can be well below the bound. At +smaller B, the distribution is wider and the quantizer has less concentration to +exploit. + +Block decomposition preserves the theoretical bound but may give worse +**practical** MSE than a single full-dimension rotation, because each B-dim block operates at a less favorable point on the distortion curve. This must be -quantified empirically: compare actual normalized MSE for d=64 blocks vs. -d=768 single-rotation at the same bit width. If the practical gap is -significant (e.g., >2x), a larger block size (128 or 256) would narrow it. - -**SRHT approximation caveat.** Theorems 1 and 2 in [1] are proved for true -random orthogonal matrices (QR of Gaussian), not SRHT. Our use of SRHT is a -structured approximation that does not strictly satisfy the theorems' -assumptions. This is the same approximation the current single-SRHT -implementation makes — block decomposition does not introduce a new gap. However, -the approximation quality depends on dimension: at d=1024 the 3-round SRHT -closely approximates a random rotation, while at d=64 the approximation is -weaker (fewer mixing stages). This means the Max-Lloyd centroids (computed for -the analytical Beta distribution) may be slightly suboptimal for the actual SRHT -coordinate distribution at d=64, and the QJL scale factor `√(π/2)/B` may not -exactly cancel the SRHT's expected absolute coordinate value. - -Empirical validation is needed to confirm that: (a) the MSE bound holds in -practice at d=64 with SRHT, (b) the QJL bias at d=64 is acceptable, and (c) -3 rounds of D·H is sufficient mixing at d=64. The number of SRHT rounds directly -affects mixing quality: at d=1024, 3 rounds provides 3 × log₂(1024) = 30 -mixing stages, while at d=64 it provides only 3 × log₂(64) = 18. The coordinate -distribution should be tested at 3, 4, and 5 rounds to determine the minimum -sufficient for near-uniform mixing at d=64. - -If the 3-round SRHT proves insufficient at d=64, fallback options include: - -- **More SRHT rounds.** Increasing from 3 to 4 or 5 rounds of D·H improves - mixing quality at modest O(d log d) cost. -- **Larger block size.** Using B=128 instead of B=64 — still a power of 2, still - divides all common embedding dimensions, and the SRHT approximation is better - at d=128 than d=64. PDX Stage 2 would need to handle the dimension-block vs. - vector-block size mismatch (128-dim blocks with 64-vector groups). -- **Dense rotation at d=64.** Use a 64×64 random orthogonal matrix (QR of - Gaussian) instead of SRHT for all blocks. At 64×64×4 = 16 KB this is - negligible storage. This gives exact theoretical guarantees at the cost of - O(d²) per-block rotation instead of O(d log d). For d=64 the difference is - small: 4096 FLOPs (matmul) vs. ~384 FLOPs (SRHT). Note: if using the dense - fallback, each block should still have an **independent** rotation matrix - (generated from different seeds). Sharing a single rotation matrix across - blocks introduces correlation between quantization errors, which doesn't - affect the MSE bound but weakens the QJL independence assumption — the QJL - rotation must be independent of the MSE rotation for Theorem 2's unbiasedness - guarantee. - -#### Per-block QJL correction - -The QJL correction also operates per-block, using an independent 64-dim -rotation per block (distinct from the MSE rotation). For each block, the QJL -computes `sign(Sₖ × rₖ)` where `rₖ = xₖ - x̂ₖ` is the per-block residual and -`Sₖ` is the block's QJL rotation. The overall inner product estimator is: +quantified empirically (see Experimental plan). + +**SORF approximation caveat.** Theorems 1 and 2 in [1] are proved for true +random orthogonal matrices (QR of Gaussian), not SORF. The 3-round SORF +construction `HD₃·HD₂·HD₁` [5] is a structured approximation. The +approximation quality depends on dimension: 3 rounds provides 3 × log₂(B) +mixing stages (18 at B=64, 21 at 128, 24 at 256, 30 at 1024). Empirical +validation is needed for each candidate B — see Experimental plan. + +**Fallback: dense rotation.** If SORF proves insufficient at the chosen B, use a +B × B random orthogonal matrix (QR of Gaussian) instead. Storage at B=128: +128×128×4 = 64 KB per block. With k blocks for MSE: total = k × 64 KB. For +d=768, B=128 (k=6): 384 KB of MSE rotation matrices. This is significant for +small columns (1K vectors at 3 MB uncompressed ≈ 13% overhead) but amortizes to +negligible for large columns (100K+ vectors). Each block must have an +**independent** rotation matrix (different seeds) to avoid correlation between +quantization errors across blocks. + +#### QJL correction + +The QJL stage in the TurboQuant paper [1] uses a random Gaussian projection +matrix S ∈ ℝ^(d×d) with i.i.d. N(0,1) entries — **not** an orthogonal rotation. +The unbiasedness proof (Lemma 4 in [1]) and the scale factor `√(π/2)/d` are +derived specifically for Gaussian S. The [current implementation][current-impl] +uses SORF instead, which is an additional approximation. + +The QJL correction has several possible strategies, differing in projection +type (Gaussian vs SORF), scope (per-block vs full-dimension), and whether QJL +is used at all. The choice significantly affects correctness guarantees, +variance, storage, and speed. + +**Per-block Gaussian QJL** is the paper-correct approach: each block gets an +independent Gaussian projection Sₖ ∈ ℝ^(B×B) with i.i.d. N(0,1) entries. Since +Gaussian matrices work at any dimension, no padding is needed for the QJL stage. +Each block's QJL is provably unbiased by Lemma 4 in [1], and the sum over +blocks is also unbiased: `E[] = `. However, the per-block +variance is **d/B times higher** than full-dimension QJL: ``` - = Σ_k +Per-block (B-dim): Var[] ≤ (π / (2B)) × ‖y‖² +Full-dim (d-dim): Var[] ≤ (π / (2d)) × ‖y‖² ``` -Each block's correction is an independent unbiased estimate of ``, so -the sum is unbiased: `E[] = `. Theorem 2 of [1] applies -per-block because the QJL rotation `Sₖ` is independent of the MSE quantization -noise in block k. - -Compared to a single full-dimension QJL with padding (768→1024), per-block QJL: - -- **Wastes zero bits.** All 768 sign bits encode actual residual information, - vs. 768 useful out of 1024 for the padded approach (25% waste eliminated). -- **Eliminates variance inflation.** No measurement budget is diluted by - zero-padded coordinates. -- **Reuses the same rotation infrastructure** as the MSE stage (64-dim blocks), - simplifying the implementation. -- **Trades cross-block mixing for efficiency.** A full-dimension rotation would - mix residual information across blocks, potentially reducing variance. But - this benefit is outweighed by the concrete information loss from padding waste - in non-power-of-2 dimensions. +At d=768, B=128: 6× more variance. Storage: B×B×4 bytes per block (384 KB for +k=6 at B=128). Encode/decode cost: O(B²) matmul per block. + +**Per-block SORF QJL** substitutes a B-dim SORF (`HD₃·HD₂·HD₁` [5]) for the +Gaussian matrix. This is NOT theoretically justified — Lemma 4 requires Gaussian +or Haar-orthogonal S, and SORF is only an approximation to Haar measure. +However, the [current implementation][current-impl] already uses SORF for QJL at +d=1024 with acceptable results (~11% mean relative error for power-of-2 dims), +demonstrating practical viability. The tradeoff vs Gaussian is compelling: +O(B log B) speed (5× faster at B=128), O(B) storage (over 1000× less). Quality +at B=128 needs validation — with only 21 mixing stages, the approximation to +Haar measure is weaker than at d=1024 (30 stages). + +**Full-dimension padded SORF QJL** applies a single SORF at the padded +dimension (e.g., 1024 for d=768) to the full residual vector `r = x - x̂`, +matching the [current implementation][current-impl]. The higher dimension gives +better SORF-to-Haar convergence (30 mixing stages at d=1024 vs 21 at B=128) and +full-dimension variance `(π/(2d))·‖y‖²`, but wastes `(padded_d - d)/padded_d` +of the sign bits on zero-padded coordinates (25% at 768→1024). This approach +requires computing the full residual from all blocks before applying QJL, +adding a full-dimension decode step to the encode path. + +**MSE-only (no QJL)** skips the QJL correction entirely. Inner product +estimates are biased but this may be acceptable for ANN ranking where relative +ordering matters more than absolute accuracy. Eliminates all QJL storage and +computation. + +**QJL strategy options** (to be experimentally compared): + +| Strategy | Theoretical | Variance | Padding waste | Storage | Speed (per block) | +| -------------------- | ----------------- | -------------- | ------------- | --------------- | ----------------- | +| Per-block Gaussian | Correct (Lemma 4) | (π/(2B))·‖y‖² | None | k×B²×4 bytes | O(B²) | +| Per-block SORF | Approximate | ~(π/(2B))·‖y‖² | None | k×3×B bits | O(B log B) | +| Full-dim padded SORF | Approximate | ~(π/(2d))·‖y‖² | (pad-d)/pad | 3×padded_d bits | O(d log d) | +| MSE-only | N/A | N/A | N/A | None | 0 | #### Norm architecture -The TurboQuant array itself operates only on unit-norm 64-dim sub-vectors. Norms +The TurboQuant array itself operates only on unit-norm B-dim sub-vectors. Norms are externalized into a separate child array, following the pattern established by the NormVector encoding (PR #7251). @@ -229,85 +258,64 @@ encoding to them. The cascading compressor treats norms like any other float column and is free to re-encode them with ALP, Pco, FastLanes, or other float compression schemes. -**Benefits of externalizing norms:** - -- **L2 norm**: Read the block norms array directly, compute `√(Σ_k ‖xₖ‖²)`. - This is O(k) per vector — modest cost compared to full decompression. -- **Cosine similarity**: Requires per-block norm weighting (see below), but the - norms tensor can be read once and reused for all pairs. -- **Dot product**: ` = Σ_k ‖aₖ‖ · ‖bₖ‖ · <âₖ, b̂ₖ>` where `<âₖ, b̂ₖ>` - is the quantized unit-norm dot product for block k. The norms tensor provides - the weights. - #### Quantized-domain operations with per-block norms -With block decomposition and externalized norms, quantized cosine similarity -requires more work than the current single-block approach. - -**Quantized dot product:** - -``` - ≈ Σ_k ‖aₖ‖ · ‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · centroids[code_bₖ[j]] -``` - -Per-block: compute the unit-norm quantized dot product (sum over 64 centroid -products), then weight by both vectors' block norms. - -**Quantized cosine similarity:** - -``` -cos(a, b) ≈ / (‖a‖ · ‖b‖) - = Σ_k ‖aₖ‖ · ‖bₖ‖ · unit_dotₖ - / (√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²)) -``` - -This requires reading all block norms for both vectors, computing per-block -weighted dot products, and normalizing by global norms. The norms tensor should -be read once per scan query and cached. +All quantized-domain operations require reading the block norms for both +operands. This is a cost increase vs. the [current implementation][current-impl] +which stores a single global norm per vector. -For the PDX scan kernel (Stage 2), the per-block norm weighting integrates -naturally: after accumulating the unit-norm dot product for each block of 64 -dimensions, multiply by the query and data block norms before moving to the -next block. +- **L2 norm**: Read block norms, compute `√(Σ_k ‖xₖ‖²)`. O(k) per vector — a + regression from the current O(1) single-norm readthrough, but modest compared + to full decompression. +- **Dot product**: ` ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · +centroids[code_bₖ[j]]`. Per-block: compute unit-norm quantized dot product + (sum of B centroid products), then weight by both vectors' block norms. +- **Cosine similarity**: `cos(a, b) ≈ (Σ_k ‖aₖ‖·‖bₖ‖·unit_dotₖ) / +(√(Σ_k ‖aₖ‖²) · √(Σ_k ‖bₖ‖²))`. Requires global norms reconstructed from + block norms. The norms tensor should be read once per scan query and cached. #### Encoding algorithm ``` -Input: x ∈ ℝ^d, bit_width b, block_size B=64 +Input: x ∈ ℝ^d, bit_width b, block_size B (power of 2) k = ⌈d/B⌉ # Block split and normalize for i in 0..k: - xᵢ = x[i*B .. min((i+1)*B, d)] + xᵢ = x[i*B .. min((i+1)*B, d)], zero-pad to B if straggler nᵢ = ‖xᵢ‖ if nᵢ > 0: - ûᵢ = xᵢ / nᵢ (unit normalize, pad to B if straggler) + ûᵢ = xᵢ / nᵢ else: - ûᵢ = zeros(B) (zero-norm block: skip quantization) + ûᵢ = zeros(B) (zero-norm block: skip quantization) -# MSE stage (per block) +# MSE stage (per block, SORF rotation) for i in 0..k: if nᵢ > 0: - rᵢ = Rotateᵢ(ûᵢ) (SRHT for full blocks, dense for straggler) - cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook for full blocks, own for straggler) + rᵢ = SORFᵢ(ûᵢ) (3-round HD, independent per block) + cᵢ[j] = nearest_centroid(rᵢ[j]) (shared codebook) else: - cᵢ[j] = 0 (arbitrary, never used in decode) + cᵢ[j] = 0 Store: codes (k × B per vector), block_norms (k per vector), - centroids (shared), rotation params (per block, shared across vectors) + centroids (shared), SORF signs (k × 3 × B, shared across vectors) -# QJL stage (per block, optional) +# QJL stage (optional, one of four strategies) for i in 0..k: if nᵢ > 0: - x̂ᵢ = decode_mse_block(cᵢ, centroids, Rotateᵢ) (unit-norm reconstruction) - rᵢ = ûᵢ - x̂ᵢ (per-block unit-norm residual) - γᵢ = ‖rᵢ‖ (per-block residual norm) - sᵢ = sign(Rotate_qjlᵢ(rᵢ)) (independent rotation per block) + x̂ᵢ = decode_mse_block(cᵢ, centroids, SORFᵢ) + rᵢ = ûᵢ - x̂ᵢ (per-block unit-norm residual) + γᵢ = ‖rᵢ‖ + sᵢ = sign(Projᵢ × rᵢ) (see QJL strategy options) else: γᵢ = 0, sᵢ = zeros -Store: qjl_signs (k × B per vector), qjl_residual_norms (k per vector), - qjl_rotation params (per block, shared across vectors) +# Projᵢ is one of: +# - Sᵢ ∈ ℝ^(B×B) Gaussian i.i.d. N(0,1) (per-block Gaussian) +# - SORF_qjlᵢ at B-dim (per-block SORF) +# - SORF_qjl at padded_d-dim on full r (full-dim padded SORF) + +Store: qjl_signs, qjl_residual_norms, projection params (strategy-dependent) ``` #### Decoding algorithm @@ -315,131 +323,105 @@ Store: qjl_signs (k × B per vector), qjl_residual_norms (k per vector), ``` # MSE decode (produces unit-norm sub-vectors) for i in 0..k: - r̂ᵢ[j] = centroids[cᵢ[j]] (codebook lookup) - ûᵢ = Rotate⁻¹ᵢ(r̂ᵢ) (inverse rotation) + r̂ᵢ[j] = centroids[cᵢ[j]] + ûᵢ = SORF⁻¹ᵢ(r̂ᵢ) # QJL correction (if present) for i in 0..k: if γᵢ > 0: - correctionᵢ = (√(π/2) / Bᵢ) × γᵢ × Rotate⁻¹_qjlᵢ(sᵢ) + correctionᵢ = (√(π/2) / dim_proj) × γᵢ × Projᵢᵀ × sᵢ ûᵢ += correctionᵢ +# dim_proj = B for per-block strategies, padded_d for full-dim strategy # Denormalize and concatenate (using externalized norms) for i in 0..k: x̂ᵢ = nᵢ × ûᵢ -x̃ = concat(x̂₀, x̂₁, ..., x̂ₖ₋₁)[0..d] (truncate straggler padding) +x̃ = concat(x̂₀, x̂₁, ..., x̂ₖ₋₁)[0..d] ``` ### Stage 2: PDX dimension-major layout In a second stage, we transpose the code storage from row-major to -dimension-major within groups of 64 vectors, following the PDX layout [4]. +dimension-major within groups of 64 vectors, following the PDX layout [4]. The +64-vector group size is independent of the TQ block size B. **Row-major (Stage 1):** ``` -Vector 0: [b0_d0 b0_d1 ... b0_d63 | b1_d0 ... b1_d63 | ... ] -Vector 1: [b0_d0 b0_d1 ... b0_d63 | b1_d0 ... b1_d63 | ... ] +Vector 0: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ] +Vector 1: [tq0_d0 ... tq0_d(B-1) | tq1_d0 ... tq1_d(B-1) | ... ] ... ``` -All codes for a single vector are contiguous. Good for per-vector decode, bad -for vectorized scans. - **PDX (Stage 2), within each 64-vector chunk:** ``` -Block 0, dim 0: [v0 v1 v2 ... v63] -Block 0, dim 1: [v0 v1 v2 ... v63] +TQ block 0, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim 1: [v0 v1 v2 ... v63] ... -Block 0, dim 63: [v0 v1 v2 ... v63] -Block 1, dim 0: [v0 v1 v2 ... v63] +TQ block 0, dim (B - 1): [v0 v1 v2 ... v63] +TQ block 1, dim 0: [v0 v1 v2 ... v63] ... ``` All codes for the same dimension across 64 vectors are contiguous. The inner -loop (over vectors) has no inter-vector data dependencies and auto-vectorizes -into SIMD. +loop (over 64 vectors) has no inter-vector data dependencies and +auto-vectorizes into SIMD. TQ block boundaries affect only where norm weighting +occurs in the distance kernel — they don't affect the PDX transpose itself. -This creates a natural **64×64 tile** structure: 64 dimension-block coordinates -× 64 vectors per chunk. Each tile's codes are contiguous in memory. +This creates **B × 64 tiles**: B dimension coordinates × 64 vectors per chunk. #### Relationship to chunking -The PDX 64-vector groups are effectively chunks. The array is logically a -sequence of chunks, where each chunk contains 64 vectors (the last chunk may -be smaller). Within each chunk, codes and QJL signs are stored in -dimension-major order. - -Shared metadata (centroids, rotation signs/matrices) lives at the array level, -not per-chunk. Each chunk is a view over the same codebook and rotations. +The PDX 64-vector groups are effectively chunks. Shared metadata (centroids, +rotation signs/matrices) lives at the array level, not per-chunk. -**Open design question:** The cleanest model may be to make each 64-vector -chunk a self-contained TurboQuant array of 64-dim unit-norm vectors -(dimension-major internally), with the chunk-level array managing the sequence -of chunks and the shared metadata. This interacts with Vortex's existing -chunked/sequence array patterns and needs further design work. Key questions: +**Open design questions:** - Should slice/take produce row-major or PDX results? Row-major is simpler - (un-transpose on the fly) but loses the scan benefit. PDX is preservable - only for aligned 64-vector slices. + (un-transpose on the fly) but loses the scan benefit. - Should the PDX flag be a property of the encoding or a separate layout layer? -- How does the PDX transpose interact with the cascading compressor? The - compressor sees the codes child array — should it see the transposed or - un-transposed version? +- How does the PDX transpose interact with the cascading compressor? #### Quantized distance computation -With PDX layout and a precomputed distance lookup table, the inner loop for -quantized dot product becomes: - ```rust // Precompute: dist_table[i][j] = centroids[i] * centroids[j] // At b=4: 16×16 = 256 floats = 1KB, fits in L1 cache -for block in 0..num_blocks { - let qa = query_codes[block * 64 + dim]; - for dim in 0..64 { - let row = &dist_table[qa as usize]; - for v in 0..64 { // SIMD-friendly: no inter-vector dependencies +for tq_block in 0..num_tq_blocks { + for dim in 0..B { + let qd = query_codes[tq_block * B + dim]; + let row = &dist_table[qd as usize]; + for v in 0..64 { // SIMD-friendly: no inter-vector deps distances[v] += row[codes[offset + v] as usize]; } } - // Weight by query_block_norm × data_block_norms[v] for each v + // Weight by query_block_norm × data_block_norms[v] for this TQ block } ``` -This is a table-lookup-and-accumulate kernel — a single dependent load plus an -FP add per element, with the table row fitting in L1. PDX [4] reports an average -2x speedup from this layout on unquantized float vectors; with quantized codes -the lookup table further reduces arithmetic intensity. - **Full decompression from PDX layout.** When a scan is not the use case (e.g., -`scalar_at` or `to_canonical`), the 64-vector chunks are un-transposed back to -row-major before per-vector inverse rotation decoding. +`scalar_at` or `to_canonical`), 64-vector chunks are un-transposed back to +row-major before per-vector inverse SORF decoding. ## Array layout ``` -TurboQuantArray (operates on unit-norm 64-dim sub-vectors) -├── metadata: { dimension: u32, bit_width: u32, num_blocks: u32, -│ has_qjl: bool, is_pdx: bool, has_straggler: bool } +TurboQuantArray (operates on unit-norm B-dim sub-vectors) +├── metadata: { dimension: u32, bit_width: u32, block_size: u32, +│ num_blocks: u32, has_qjl: bool, is_pdx: bool } │ │ # Per-row children (sliced/taken on row operations) -├── codes: FixedSizeListArray # len=num_rows, list_size=num_blocks×64 +├── codes: FixedSizeListArray # len=num_rows, list_size=num_blocks×B │ │ # Shared children (cloned on row operations, not sliced) ├── centroids: PrimitiveArray # len=2^(bit_width-1) [MSE codebook] -├── mse_rotation_signs: PrimitiveArray # len=(num_blocks-S)×3×64 [per-block MSE SRHT] -├── [straggler_mse_rotation]: PrimitiveArray # straggler_dim × straggler_dim dense matrix -├── [straggler_centroids]: PrimitiveArray # centroids for straggler dimension +├── mse_rotation_signs: PrimitiveArray # len=num_blocks×3×B [per-block MSE SORF] │ -│ # Optional QJL children (also per-block) -├── [qjl_signs]: FixedSizeListArray # len=num_rows, list_size=num_blocks×64 -├── [qjl_rotation_signs]: PrimitiveArray # len=(num_blocks-S)×3×64 [per-block QJL SRHT] -├── [straggler_qjl_rotation]: PrimitiveArray # straggler dense QJL matrix -│ -│ S = 1 if has_straggler else 0 +│ # Optional QJL children (per-block Gaussian) +├── [qjl_signs]: FixedSizeListArray # len=num_rows, list_size=num_blocks×B +├── [qjl_matrices]: PrimitiveArray # len=num_blocks×B×B [Gaussian i.i.d. N(0,1)] Externalized (lives in parent encoding, not in TurboQuantArray): ├── block_norms: FixedSizeListArray # len=num_rows, list_size=num_blocks @@ -452,7 +434,7 @@ in row-major or PDX-transposed layout. ## Compression ratio For f32 input at dimension d with bit width b (QJL, so b-1 MSE bits + 1 QJL -bit), k = ⌈d/64⌉ blocks, B = 64: +bit), k = ⌈d/B⌉ blocks: | Component | Bits per vector | | ------------------ | --------------- | @@ -461,135 +443,168 @@ bit), k = ⌈d/64⌉ blocks, B = 64: | Block norms | k × norm_bits | | QJL residual norms | k × norm_bits | -| Component | Shared bits | -| ------------------ | ------------ | -| Centroids | 2^(b-1) × 32 | -| MSE rotation signs | k × 3 × B | -| QJL rotation signs | k × 3 × B | +| Component | Shared bits | +| --------------------- | ------------ | +| Centroids | 2^(b-1) × 32 | +| MSE SORF signs | k × 3 × B | +| QJL Gaussian matrices | k × B² × 32 | -### Example: f32, d=768, b=5 (4-bit MSE + 1-bit QJL), 1000 vectors, k=12 +### Example: f32, d=768, b=5, B=128, N=1000 vectors, k=6 - Uncompressed: 768 × 32 × 1000 = 24,576,000 bits (3,000 KB) -- MSE codes: 12 × 64 × 4 × 1000 = 3,072,000 bits -- QJL signs: 12 × 64 × 1 × 1000 = 768,000 bits -- Block norms: 12 × 32 × 1000 = 384,000 bits -- QJL residual norms: 12 × 32 × 1000 = 384,000 bits -- Overhead: 16 × 32 + 2 × 12 × 192 = 5,120 bits -- **Total compressed: 4,613,120 bits (563 KB)** -- **Ratio: 5.3×** - -### Example: f32, d=1024, b=5, 1000 vectors, k=16 - -- Uncompressed: 1024 × 32 × 1000 = 32,768,000 bits (4,000 KB) -- MSE codes: 16 × 64 × 4 × 1000 = 4,096,000 bits -- QJL signs: 16 × 64 × 1 × 1000 = 1,024,000 bits -- Block norms: 16 × 32 × 1000 = 512,000 bits -- QJL residual norms: 16 × 32 × 1000 = 512,000 bits -- Overhead: 16 × 32 + 2 × 16 × 192 = 6,656 bits -- **Total compressed: 6,150,656 bits (750 KB)** -- **Ratio: 5.3×** - -### Comparison: current single-SRHT TurboQuant - -| Dimension | Current ratio | Block ratio | Delta | -| ----------------- | ------------- | ----------- | ----------- | -| 768 (padded→1024) | 4.7× | 5.3× | +13% better | -| 1024 (no padding) | 6.3× | 5.3× | -16% worse | - -For power-of-2 dimensions, the block approach uses more storage due to per-block -norms and QJL residual norms (k × 2 × 32 bits/vector overhead). At d=1024 with -k=16 blocks, this is 1024 bits/vector = 128 bytes, which is significant. - -This is the fundamental tradeoff: block decomposition optimizes for -non-power-of-2 dimensions (which dominate real embedding models) at the cost of -modest overhead for power-of-2 dimensions. +- MSE codes: 6 × 128 × 4 × 1000 = 3,072,000 bits +- QJL signs: 6 × 128 × 1 × 1000 = 768,000 bits +- Block norms: 6 × 32 × 1000 = 192,000 bits +- QJL residual norms: 6 × 32 × 1000 = 192,000 bits +- Shared: 512 + 2,304 + 3,145,728 = 3,148,544 bits +- **Total compressed: 7,372,544 bits (899 KB)** +- **Ratio: 3.3×** (QJL Gaussian matrices dominate at small N) + +At N=100K vectors the shared overhead amortizes to <0.04 bits/vector, giving +ratio ≈ 5.7×. At N=1M, ratio ≈ 5.8×. + +### Comparison across configurations (f32, d=768, b=5) + +| Config | B | Ratio (N=1K) | Ratio (N=100K) | Notes | +| ------------------------------------------------ | --- | ------------ | -------------- | ----------------------------- | +| Block MSE-only | 128 | 6.5× | 6.5× | No QJL; biased inner products | +| Block + per-block Gaussian QJL | 128 | 3.3× | 5.7× | Unbiased; matrices amortize | +| [Current][current-impl] (padded SORF + SORF QJL) | — | 4.7× | 4.7× | 33% padding waste | + +For small columns, MSE-only or the current padded approach may be preferable. +For large columns (the common case for embedding tables), per-block Gaussian QJL +gives the best ratio. ## Performance analysis ### Encode throughput -With k blocks at d-dimensional input, encoding requires: +With k blocks at B-dim, encoding requires per block: -| Operation | Per-block | Total (k blocks) | -| ---------------------- | ------------------ | ---------------- | -| SRHT (3-round, B=64) | ~384 FLOPs | k × 384 | -| Centroid lookup (B=64) | 64 binary searches | k × 64 | -| QJL SRHT | ~384 FLOPs | k × 384 | -| Norm computation | 64 FMA + sqrt | k × 65 | +| Operation | FLOPs (B=128) | +| --------------------------- | ------------------------------------- | +| MSE SORF (3-round) | 3 × 128 × log₂(128) + 3 × 128 ≈ 3,072 | +| Centroid lookup | 128 binary searches | +| QJL Gaussian matmul (S × r) | B² = 16,384 | +| Norm computation | 128 FMA + sqrt ≈ 129 | -For d=768, k=12: total ~20K FLOPs per vector (SRHT + QJL SRHT) vs. current -single-SRHT approach: 2 × ~10K = ~20K FLOPs. Encode throughput should be -comparable — the per-block overhead is offset by smaller per-block transforms. +For d=768, k=6: MSE total ≈ 18K FLOPs, QJL matmul ≈ 98K FLOPs. The QJL +Gaussian matmul dominates encode cost — ~5× more expensive than the +[current][current-impl] SORF-based QJL. Acceptable for offline encoding. ### Decode throughput -Decoding is dominated by k inverse rotations + codebook lookups: - -| Operation | Per-block | Total (k blocks) | -| ---------------- | -------------- | ---------------- | -| Codebook lookup | 64 table reads | k × 64 | -| Inverse SRHT | ~384 FLOPs | k × 384 | -| QJL inverse SRHT | ~384 FLOPs | k × 384 | -| Denormalize | 64 multiplies | k × 64 | +| Operation | FLOPs per block (B=128) | +| -------------------------------- | ----------------------- | +| Codebook lookup | 128 table reads | +| Inverse SORF | ≈ 3,072 | +| QJL Gaussian matmul (Sᵀ × signs) | B² = 16,384 | +| Denormalize | 128 multiplies | -For d=768: ~10K FLOPs decode vs. current ~10K (single 1024-dim inverse SRHT). -Similar throughput. +For d=768: MSE decode ≈ 18K FLOPs, QJL decode ≈ 98K FLOPs. QJL decode is +significantly more expensive due to the dense matmul. For scan workloads that +only need inner products (not full reconstruction), the fused distance +computation path avoids full decode entirely. ### Scan throughput (PDX, Stage 2) The PDX scan kernel is memory-bandwidth bound, not compute bound. The lookup -table fits in L1 (1 KB at b=4). The inner loop is a single dependent load + -FP add per code byte. With 64 vectors per SIMD pass, this achieves near-peak -memory bandwidth utilization. +table fits in L1 (1 KB at b=4). With 64 vectors per SIMD pass, this achieves +near-peak memory bandwidth utilization. Norm weighting adds k scalar multiplies +per 64-vector group — negligible. ### Benchmarking plan -Stage 1 should include benchmarks comparing: +Stage 1: 1. Encode throughput: block TQ vs. current TQ at d=128, 768, 1024 2. Decode throughput: same dimensions -3. Quantized cosine similarity throughput: block vs. current -4. L2 norm readthrough latency: O(k) block norms vs. O(1) current +3. Quantized cosine similarity: block vs. current (includes norm overhead) +4. L2 norm readthrough: O(k) block norms vs. O(1) current + +Stage 2: + +5. PDX scan throughput vs. row-major scan at d=768, 1024 +6. Full decompression from PDX layout (includes un-transpose overhead) + +## Experimental plan -Stage 2 should benchmark: 5. PDX scan throughput vs. row-major scan at d=768, 1024 6. Full decompression from PDX layout (includes un-transpose overhead) +Before committing to specific parameters, the following experiments are needed: + +### MSE quality vs. block size + +- Compare actual normalized MSE at B ∈ {64, 128, 256} vs. full-dimension + single-SORF at bit widths b ∈ {2, 3, 4, 5, 8} +- Test SORF coordinate distribution at each B: histogram vs. analytical Beta +- Test 3, 4, and 5 SORF rounds at each B (3 rounds provides 3 × log₂(B) + mixing stages: 18 at B=64, 21 at 128, 24 at 256) + +### QJL strategy comparison + +Compare all four strategies at d=768 with B ∈ {64, 128, 256}: + +- **Per-block Gaussian QJL** (paper-correct): measure inner product bias and + variance. Baseline for correctness. +- **Per-block SORF QJL**: measure bias/variance vs. per-block Gaussian at same + B. Quantify the quality cost of the SORF approximation at small block + dimensions. Test at 3, 4, 5 SORF rounds. +- **Full-dimension padded SORF QJL** (current approach): measure for comparison. + Higher dimension gives better SORF-to-Haar convergence (30 mixing stages at + d=1024) which may compensate for the padding waste. This is the key + comparison — does the better convergence of full-dim SORF outweigh the 25% + wasted sign bits? +- **MSE-only** (no QJL): measure inner product bias to establish baseline. + +Key metrics: + +- Mean signed relative error on inner products (bias) +- Variance of inner product estimates +- ANN recall@k on standard benchmarks (e.g., SIFT, GloVe) + +Per-block Gaussian QJL has d/B times more variance than full-dim QJL (6× at +d=768, B=128). Per-block SORF QJL has additional approximation error on top of +that. Full-dim padded SORF QJL has lower variance but wastes bits on padding. +The experiment should determine which effect dominates in practice. + +### Straggler handling + +- Compare padded straggler (zero-pad to B, use SORF) vs. dense rotation at + actual straggler dimension with own centroids +- Measure whether dense straggler enables increasing B (e.g., B=256 with + d=768 = 3×256, no straggler at all) ## Phasing **Stage 1** (block decomposition): Modify the existing TurboQuant encoding to -use 64-dim blocks with per-block rotation, externalized norms, and per-block -QJL. Row-major code layout. Full encode/decode, slice/take, serde, scheme -integration. This is a self-contained improvement that eliminates padding waste -and QJL variance inflation for all dimensions. +use B-dim blocks with per-block SORF rotation, externalized norms, and +configurable QJL strategy (per-block Gaussian, per-block SORF, full-dim padded +SORF, or MSE-only). Row-major code layout. -**Stage 2** (PDX layout): Add the dimension-major code transposition within -64-vector chunks. PDX-native distance computation kernels. This requires -resolving the open design questions around chunk alignment, slice/take semantics, -and interaction with the compressor pipeline. +**Stage 2** (PDX layout): Add dimension-major code transposition within +64-vector chunks. PDX-native distance computation kernels. Requires resolving +open design questions around chunk alignment and compressor interaction. ## Test migration -The current TurboQuant test suite validates specific behaviors that will change: - -- **Slot counts** (MSE=4, QJL=7): Will change to reflect new slot structure - with per-block rotation signs, straggler fields, and externalized norms. -- **Serde roundtrips**: Must be rewritten for new metadata (num_blocks, - has_straggler) and new child structure. -- **MSE bound tests**: Can be adapted — same bound, different dimensions tested. - Must add d=64 specifically. -- **QJL bias tests**: Must add d=64 bias tests. Existing d=768 tests should - show improved bias (no padding). -- **Quantized cosine similarity**: Must be updated for per-block norm weighting. -- **Slice/take**: Same semantics (per-row data sliced, shared data cloned), but - more children to manage. - -New tests needed: - -- SRHT quality at d=64: coordinate distribution vs. analytical Beta at 3, 4, 5 rounds -- Practical MSE comparison: d=64 blocks vs. d=768 single-rotation at same bit width -- Straggler block handling: dense rotation, separate centroids +The [current TurboQuant test suite][current-impl] will need rewriting: + +- **Slot counts**: New structure with per-block SORF signs, QJL Gaussian + matrices, externalized norms. +- **Serde roundtrips**: New metadata (block_size, num_blocks). +- **MSE bound tests**: Add tests at B ∈ {64, 128, 256}. +- **QJL tests**: Compare all four strategies (per-block Gaussian, per-block + SORF, full-dim padded SORF, MSE-only). +- **Quantized cosine similarity**: Update for per-block norm weighting. +- **Slice/take**: More children to manage, same semantics. + +New tests: + +- SORF quality at B=64, 128, 256: distribution vs. Beta at 3, 4, 5 rounds +- Practical MSE comparison: block vs. single-rotation at same bit width +- QJL comparison: per-block Gaussian vs. per-block SORF vs. full-dim padded SORF - Zero-norm sub-vector handling -- Per-block norm consistency with input vectors -- Encode/decode benchmarks (see Performance analysis) +- Straggler handling (padded initially, dense rotation as follow-up) +- Encode/decode benchmarks ## Future work: GPU decode and fused distance computation @@ -598,12 +613,11 @@ New tests needed: For ANN search workloads, the dominant operation is computing distances between a query vector and millions of database vectors. On GPU, the goal is to perform this computation directly on the compressed representation, avoiding the cost -of materializing full decompressed vectors in HBM. BlockTurboQuant's 64-dim -block structure maps naturally to GPU tile sizes and tensor core operations. +of materializing full decompressed vectors in HBM. ### Decode as GEMM -The decode path for a single 64-dim block is: +The decode path for a single B-dim block is: ``` decoded_block = norm × R⁻¹ × codebook_lookup(codes) @@ -613,106 +627,50 @@ For a batch of N vectors sharing the same block's rotation matrix R⁻¹: ``` decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes) - ↑ 64×N matrix - ↑ 64×64 × 64×N = GEMM + ↑ B×N matrix + ↑ B×B × B×N = GEMM ``` -The codebook lookup produces a 64×N matrix (one column per vector, each entry -is `centroids[code]`), and the inverse rotation is a 64×64 matrix multiply — -a GEMM that maps directly to tensor cores. +The codebook lookup produces a B×N matrix, and the inverse rotation is a B×B +matrix multiply — a GEMM that maps directly to tensor cores. **Partial decompression pipeline on GPU:** 1. **Decompress rotation matrix** (once per block, shared across all vectors): - - If stored as bitpacked SRHT signs: FastLanes SIMD unpack on CUDA cores, - then expand to 64×64 matrix in shared memory - - If stored as dense 64×64 matrix: direct load to shared memory (16 KB) + - If stored as bitpacked SORF signs: FastLanes unpack on CUDA cores, then + expand to B×B matrix in shared memory + - If stored as dense B×B matrix: direct load to shared memory 2. **Decompress norms** (per vector, per block): - - If cascade-compressed with ALP or Pco: decompress via FastLanes on CUDA - cores into a register tile + - If cascade-compressed with ALP or Pco: decompress on CUDA cores - If uncompressed: direct load -3. **Codebook gather** (per vector, per block): - - Stage the codebook in shared memory (16 entries × 4 bytes = 64 bytes at - b=4 — trivially small) - - Gather: stream code bytes from HBM, look up centroid values in shared - memory, assemble 64×N tile - -4. **Fused GEMM + scale**: - - R⁻¹ × gathered tile (64×64 × 64×N) on tensor cores - - Column-wise multiply by norms (element-wise scale) - -Steps 3-4 can be fused into a single kernel, following the double-buffered -streaming pattern from Flash-KMeans [5]: prefetch the next batch of code bytes -from HBM while computing the current batch's GEMM on tensor cores. This avoids -materializing the full decompressed vectors in HBM — the decoded output is -either consumed immediately by a distance computation or written once to the -output buffer. +3. **Codebook gather + fused GEMM + scale**: + - Stage codebook in shared memory (trivially small) + - Stream code bytes from HBM, look up centroids, assemble B×N tile + - R⁻¹ × tile on tensor cores, scale by norms + - Follows the double-buffered streaming pattern from Flash-KMeans [6] ### Fused distance computation (no decode) -For distance computation without full decompression, the operation per block is: - -``` -dot_contribution_k = ‖query_k‖ × ‖data_k‖ × Σ_j dist_table[q_code[j]][d_code[j]] -``` - -On GPU, this becomes: - -1. **Stage distance table in shared memory**: `dist_table[i][j] = centroids[i] × -centroids[j]`, 16×16 = 1 KB at b=4. Fits trivially in shared memory. - -2. **Stream code bytes from HBM**: For each 64-vector × 64-dim tile (matching - the PDX layout), gather from the distance table and accumulate in registers. - This is a gather-reduce pattern — no GEMM, just table lookups and FP adds. - -3. **Norm weighting**: After accumulating the unit-norm dot product for all - 64 dimensions in a block, multiply by the query and data block norms. - Norms for 64 vectors fit in a single register tile. - -4. **Cross-block accumulation**: Sum the weighted dot products across all k - blocks to get the final distance estimate. - -The memory access pattern follows Flash-KMeans [5]: stream data tiles from HBM -with double-buffered prefetch, accumulate on-chip, write only the final result. -The key difference is that Flash-KMeans streams full float vectors while we -stream quantized code bytes — 4-8× less HBM bandwidth per vector. +Stage the distance table in shared memory (1 KB at b=4), stream code bytes from +HBM, gather-reduce on-chip, write only final distances. The memory access +pattern follows Flash-KMeans [6]: double-buffered prefetch, on-chip +accumulation. We stream quantized code bytes — 4-8× less HBM bandwidth than +full float vectors. ### Int8 tensor core path (b=9) -At b=9, the MSE component uses 8-bit codes. These are indices into a 256-entry -codebook, not raw int8 values — so direct int8 tensor core GEMM does not apply -without transformation. However, if the codebook is approximately linear -(centroids roughly evenly spaced), the codes could be treated as approximate -int8 values with a linear rescaling, enabling direct int8 GEMM for the inner -product computation. This sacrifices some quantization optimality (linear -quantization vs. Max-Lloyd optimal) but enables tensor core throughput. - -Whether this tradeoff is worthwhile depends on the application: for ANN ranking -(where relative ordering matters more than absolute accuracy), linear int8 may -be sufficient. For reconstruction (where MSE matters), Max-Lloyd centroids are -preferred and the gather-from-codebook path should be used. +At b=9, MSE codes are 8-bit indices. Direct int8 tensor core GEMM requires +approximately linear centroids (sacrificing Max-Lloyd optimality). Viable for +ANN ranking; prefer codebook gather for reconstruction. ### Interaction with Vortex file format -The GPU decode pipeline reads compressed data from Vortex files: - -1. **File reader** loads compressed segments from storage (S3, local SSD) -2. **Host-side cascade decompression** (BitPacked → codes, ALP → norms) or - direct GPU transfer of already-decompressed segments -3. **GPU kernel** performs fused decode or fused distance computation - -The BlockTurboQuant encoding's child arrays (codes, norms, rotation signs) are -individually compressed by the cascading compressor. For GPU decode, we need -either: - -- Host-side decompression of the cascade, then GPU transfer of the raw children -- Direct GPU decompression of FastLanes/ALP (if GPU decompression kernels exist) - -The 64-dim block structure ensures that rotation matrices (64×64 dense or 192 -bits SRHT signs) fit comfortably in GPU shared memory, enabling the fused -decode kernel without spilling to HBM. +The B-dim block structure ensures rotation matrices fit in GPU shared memory +(B×B×4 bytes). Child arrays are individually compressed by the cascading +compressor; GPU decode requires either host-side decompression + GPU transfer, +or direct GPU decompression of FastLanes/ALP. ## References @@ -729,5 +687,9 @@ Transform." Advances in Adaptive Data Analysis, 3(1-2):115-126, 2011. [4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector Similarity Search." Proceedings of SIGMOD '25. arXiv:2503.04422, March 2025. -[5] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." +[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. +"Orthogonal Random Features." Advances in Neural Information Processing +Systems 29 (NeurIPS), 2016. arXiv:1610.09072. + +[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." arXiv:2603.09229, March 2026.