Skip to content

feat: Distance trait + SIMD Hamming/cosine wiring + PaletteDistanceTable + Dockerfile docs#269

Merged
AdaWorldAPI merged 5 commits into
mainfrom
claude/distance-trait-and-simd-hamming
Apr 26, 2026
Merged

feat: Distance trait + SIMD Hamming/cosine wiring + PaletteDistanceTable + Dockerfile docs#269
AdaWorldAPI merged 5 commits into
mainfrom
claude/distance-trait-and-simd-hamming

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

Five commits, follow-on to merged PR #268:

SIMD Hamming wiring (was scalar)

  • cognitive-shader-driver/src/driver.rs:178 — shader content pre-pass now calls ndarray::hpc::bitwise::hamming_distance_raw() instead of scalar iter().zip().map(xor.count_ones()).sum() over 256 u64 words. ~8-16× speedup (AVX-512 VPOPCNTDQ).
  • lance-graph/src/datafusion_planner/vector_ops.rs:213 — DataFusion UDF hamming_distance delegates to ndarray.
  • lance-graph/src/graph/fingerprint.rs:82 — graph fingerprint Hamming delegates to ndarray.

CI / Docker AVX2 default

  • All 4 CI workflows (build.yml, rust-test.yml, style.yml, rust-publish.yml) had RUSTFLAGS: "-C debuginfo=1" which overrode .cargo/config.toml's target-cpu=x86-64-v4 and compiled at baseline x86-64 (no AVX at all).
  • Now: RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3" so CI gets AVX2.
  • Dockerfile: ENV RUSTFLAGS="-C target-cpu=x86-64-v3" so default Docker image runs on AVX2+ hardware. Dockerfile.avx512 unchanged.

Dockerfile.md documentation

  • New file: 118 LOC covering three-tier build strategy, lance-graph's ndarray SIMD usage, RUSTFLAGS vs .cargo/config.toml override behavior, AMX, NEON, decision flowchart.

Distance trait (contract::distance) — TD-DIST-1

  • Distance trait with distance(), similarity(), similarity_z() (FisherZ transform).
  • Type-intrinsic dispatch: fp_a.distance(&fp_b) monomorphizes at compile time — zero crate boundary tax. No dyn, no enum match.
  • Scalar baseline impls for [u64; 256] (Binary16K → Hamming), [u8; 6] (CamPq → L1), [u8; 3] (PaletteEdge → L1).
  • fisher_z_inverse() + mean_similarity_fisher() for safe averaging across SoA columns.
  • 11 tests.

SIMD cosine/dot wiring — TD-DIST-2

  • vector_ops.rs cosine/dot replaced 4 scalar loops with ndarray::hpc::heel_f64x8::{cosine_f32_to_f64_simd, dot_f64_simd}. Estimated 8-12× speedup on DataFusion UDF path.

bgz17 PaletteDistanceTable — TD-DIST-3

  • Palette::build_distance_table() → 256×256 u16 table (128 KB, L2-resident).
  • PaletteDistanceTable::distance(a, b) → O(1) array index.
  • edge_distance(a, b) → S+P+O table lookups.
  • Used by cascade skip (HHTL), renderer force-directed layout.

Board hygiene

  • EPIPHANIES.md: Distance dispatch FINDING (type-intrinsic dispatch, FisherZ note, full type→distance mapping table).
  • TECH_DEBT.md: TD-DIST-1/2/3 opened, then marked PAID in same session.

Test plan

  • 260 contract tests pass (including 11 new distance tests)
  • 126 bgz17 tests pass (PaletteDistanceTable compiles)
  • cargo check workspace clean
  • All 4 CI workflows include -C target-cpu=x86-64-v3
  • Dockerfile + Dockerfile.md align

Commits

  1. 2d7e6b3 fix: wire ndarray SIMD Hamming into all scalar hot paths + CI/Docker AVX2 default
  2. ca4eb8b docs: Dockerfile.md — CPU detection & SIMD dispatch documentation
  3. 3a983b4 docs: distance dispatch epiphany + 3 tech debt entries (TD-DIST-1/2/3)
  4. 277232b feat: Distance trait + SIMD cosine/dot + PaletteDistanceTable (TD-DIST-1/2/3)
  5. 6899390 chore(board): mark TD-DIST-1/2/3 paid in commit 8603148

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh


Generated by Claude Code

claude added 5 commits April 26, 2026 07:13
…AVX2 default

Hamming SIMD wiring (ndarray is mandatory, no reason for scalar):
- driver.rs:178 — shader content pre-pass now calls
  ndarray::hpc::bitwise::hamming_distance_raw() instead of
  scalar iter().zip().map(xor.count_ones()).sum() over 256 u64 words
- vector_ops.rs:213 — DataFusion UDF hamming_distance delegates to
  ndarray::hpc::bitwise::hamming_distance_raw()
- fingerprint.rs:82 — graph fingerprint Hamming delegates to ndarray

CI/Docker fix — x86-64-v3 (AVX2) as the default everywhere:
- .github/workflows/{build,rust-test,style,rust-publish}.yml all had
  RUSTFLAGS without target-cpu, which overrode .cargo/config.toml's
  x86-64-v4 and compiled at BASELINE x86-64 (no AVX at all).
  Now: RUSTFLAGS includes -C target-cpu=x86-64-v3 so CI gets AVX2.
- Dockerfile: added ENV RUSTFLAGS="-C target-cpu=x86-64-v3" so the
  default Docker image runs on AVX2+ hardware (GitHub CI, most servers).
  Dockerfile.avx512 still pins x86-64-v4 for deployment.

The split:
  LOCAL (.cargo/config.toml) → x86-64-v4 (AVX-512, developer machines)
  CI / Docker default         → x86-64-v3 (AVX2, GitHub runners)
  Dockerfile.avx512           → x86-64-v4 (AVX-512, production deploy)

ndarray's simd.rs polyfill detects AVX-512 at runtime regardless of
compile target, so the AVX2 binary still dispatches to AVX-512 kernels
on capable hardware.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Comprehensive doc covering the three-tier build strategy (AVX2 default /
AVX-512 pinned / local dev), two-layer dispatch model (compile-time
cfg(target_feature) + runtime LazyLock<Tier>), AMX detection, NEON/ARM,
RUSTFLAGS vs .cargo/config.toml override behavior, and which lance-graph
locations call ndarray SIMD.

Also: Dockerfile + Dockerfile.avx512 headers now reference Dockerfile.md.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
EPIPHANY: Distance dispatch must be type-intrinsic, not crate-boundary-
crossing. The `Distance` trait on carrier types monomorphizes at compile
time — zero dynamic dispatch, zero crate boundary tax. Contract defines
interface, ndarray provides SIMD kernels. Includes FisherZ note for
cosine similarity averaging across SoA columns. Full type→distance
mapping table (Binary16K→Hamming, Vsa16kF32→cosine/FisherZ, CamPq→ADC,
PaletteEdge→L1 table, Base17→nearest, HighHeelBGZ→cascade).

TECH_DEBT:
  TD-DIST-1: Distance trait missing from contract (blocks generic SoA sweeps)
  TD-DIST-2: vector_ops.rs 4 scalar dot/norm/cosine loops (8-12× speedup available)
  TD-DIST-3: bgz17 Palette::nearest() brute-force 256×17 (100× speedup via table)

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…T-1/2/3)

TD-DIST-1: contract::distance module — type-intrinsic dispatch trait.
  - Distance trait with distance(), similarity(), similarity_z() (FisherZ)
  - impl for [u64; 256] (Binary16K → Hamming), [u8; 6] (CamPq → L1),
    [u8; 3] (PaletteEdge → L1)
  - fisher_z_inverse(), mean_similarity_fisher() for safe averaging
  - Scalar baseline impls (work in WASM/embedded; ndarray shadows with SIMD)
  - 11 tests

TD-DIST-2: vector_ops.rs cosine/dot → ndarray SIMD.
  - cosine_distance/similarity → ndarray::hpc::heel_f64x8::cosine_f32_to_f64_simd
  - dot_product_distance/similarity → ndarray::hpc::heel_f64x8::dot_f64_simd
  - Estimated 8-12× speedup on DataFusion UDF path

TD-DIST-3: bgz17 PaletteDistanceTable — O(1) inter-centroid lookup.
  - Palette::build_distance_table() → 256×256 u16 table (128 KB, L2-resident)
  - PaletteDistanceTable::distance(a, b) → single array index
  - edge_distance(a, b) → sum of S+P+O table lookups
  - Palette::nearest() unchanged (still brute-force for query→centroid);
    table is for centroid↔centroid (cascade skip, renderer force layout)

All three TD-DIST items from TECH_DEBT.md addressed. 260 contract tests pass,
126 bgz17 tests pass, workspace compiles clean.

https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68993903ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


let similarity = dot / (norm_a * norm_b);
1.0 - similarity
ndarray::hpc::heel_f64x8::cosine_f32_to_f64_simd(a, b) as f32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore cosine distance semantics in vector_ops

cosine_distance now returns cosine_f32_to_f64_simd directly, but that helper is a cosine similarity value (used as such throughout the repo), while this API is documented and consumed as distance (1 - similarity) in compute_vector_distances. With the current code, cosine ranking is wrong (e.g., identical vectors produce ~1 instead of 0, orthogonal vectors produce 0 instead of 1), so nearest-neighbor ordering for DistanceMetric::Cosine is inverted/corrupted.

Useful? React with 👍 / 👎.

let mut table = vec![0u16; 256 * 256];
for i in 0..k {
for j in i..k {
let d = self.entries[i].l1(&self.entries[j]) as u16;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid truncating palette L1 distances to u16

Base17::l1 returns u32 and can exceed 65,535 (17 dimensions of i16 differences), but build_distance_table narrows each value to u16. This silently wraps large distances before they are used by PaletteDistanceTable::distance/edge_distance, causing incorrect palette-edge scores and wrong decisions in any path relying on the precomputed table.

Useful? React with 👍 / 👎.

@AdaWorldAPI AdaWorldAPI merged commit 6311b52 into main Apr 26, 2026
1 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants