Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions .claude/board/EPIPHANIES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2973,3 +2973,90 @@ The architecture's five consumer perspectives are not layers — they're project
**SoA vs Functional is not a choice — it's a WHERE.** BindSpace is SoA (columnar storage for SIMD). The algebra on it is Functional (methods on carriers). The SoA carries the state; the Functional methods transform it. Both exist simultaneously on the same data. The "struct of arrays vs object thinks for itself" tension resolves as: the ARRAY is the SoA, the ELEMENT (row, trajectory, fingerprint) thinks for itself via methods.

Cross-ref: CLAUDE.md §The Stance (AGI-as-glove, SoA columns ARE the AGI surface), lab-vs-canonical-surface.md (I1-I11 invariants), ExternalMembrane (contract::external_membrane), BindSpace (cognitive-shader-driver::bindspace).

## 2026-04-26 — FINDING: distance dispatch must be type-intrinsic, not crate-boundary-crossing

**Status:** FINDING
**Owner scope:** @family-codec-smith, @truth-architect, @host-glove-designer

The struct-of-arrays (BindSpace, RenderFrame, Arrow columns) carries heterogeneous
fingerprint types that each need a DIFFERENT distance function:

| Type | Distance | Where it lives | Notes |
|---|---|---|---|
| `Binary16K = [u64; 256]` | Hamming (popcount of XOR) | `ndarray::hpc::bitwise::hamming_distance_raw` | 16384-bit, SIMD VPOPCNTDQ |
| `Vsa16kF32 = [f32; 16_384]` | Cosine → FisherZ transform | `ndarray::hpc::heel_f64x8::cosine_f64_simd` | f32 dot/norm via F32x16 FMA |
| `CamPqCode = [u8; 6]` | ADC (asymmetric distance computation) | `ndarray::hpc::cam_pq::adc_distance` | Precomputed distance tables, O(1) |
| `PaletteEdge = [u8; 3]` | Palette L1 (lookup table) | `ndarray::hpc::palette_distance::SpoDistanceMatrices::distance` | bgz17 256×256 table, 1.8 ns |
| `Base17 = [u8; 17]` | Palette nearest (codebook search) | `bgz17::Palette::nearest` | 256 centroids, should use precomputed table |
| `HighHeelBGZ` container | Cascade (HHTL skip → palette → ADC fallback) | `ndarray::hpc::cascade` + `bgz-tensor::hhtl_cache` | Multi-level, route by `RouteAction` |

**The problem:** When a SoA column contains mixed types (e.g., one column is Binary16K,
another is CamPqCode), the distance dispatch currently happens at the call site — the
caller must know which distance function to use. This works inside a single crate, but
when the SoA lives in crate A (e.g., `cognitive-shader-driver::BindSpace`) and the
distance kernel lives in crate B (e.g., `ndarray::hpc::bitwise`), every call crosses
a crate boundary. That boundary is zero-cost for `#[inline]` functions, but NOT zero-cost
if the function is generic over a trait object (`dyn DistanceFn`) or involves dynamic
dispatch.

**The solution — type-intrinsic dispatch, not dynamic dispatch:**

The distance function should be a method ON the carrier type, not a free function
called FROM the SoA consumer. This follows the "object speaks for itself" doctrine
(CLAUDE.md §The Click):

```rust
// WRONG — caller must know the distance type:
let d = hamming_distance_raw(fp_a.as_bytes(), fp_b.as_bytes()); // crate boundary

// RIGHT — the type carries its own distance:
let d = fp_a.distance(&fp_b); // monomorphized, inlined, zero boundary tax
```

The contract already has `CodecRoute: Passthrough | CamPq` which names the regime.
What's missing is a `Distance` trait that each carrier implements:

```rust
pub trait Distance: Sized {
fn distance(&self, other: &Self) -> u32;
fn similarity(&self, other: &Self) -> f32 {
1.0 - (self.distance(other) as f32 / Self::MAX_DISTANCE as f32)
}
const MAX_DISTANCE: u32;
}
```

Implementations:
- `impl Distance for [u64; 256]` → `hamming_distance_raw` (inline, SIMD)
- `impl Distance for CamPqCode` → ADC lookup (precomputed table ref)
- `impl Distance for PaletteEdge` → palette L1 table lookup
- `impl Distance for Vsa16kF32` → cosine → FisherZ (F32x16 FMA)

The trait monomorphizes at compile time — no dynamic dispatch, no crate boundary
tax. The SoA column iterates with `col.chunks().map(|a, b| a.distance(b))` and
the correct distance function is selected by TYPE, not by runtime enum match.

**Where this trait should live:** `lance-graph-contract` (zero deps). The
implementations live in ndarray (for SIMD kernels) or in the carrier crate
(for precomputed tables). The contract defines the interface; ndarray provides
the hardware acceleration; the SoA consumer never needs to know which distance
kernel runs.

**Hard-coded dispatch within the same crate is fine** — when `BindSpace` calls
`hamming_distance_raw` on its `content` column, that's a direct function call
into ndarray, monomorphized and inlined. The problem only arises if we try to
make the SoA generic over distance type via `dyn` trait objects. Don't do that.
Keep the dispatch compile-time via generics or type-specific methods. The SoA
pays zero boundary tax because Rust's monomorphization erases the crate boundary.

**FisherZ note:** Cosine similarity ∈ [-1, 1] is nonlinear for averaging. The
FisherZ transform `z = atanh(r)` maps it to a normal-distributed variable that
can be averaged, then `r = tanh(z)` maps back. This matters when the SoA
accumulates similarities across columns (e.g., weighted multi-column distance).
The `Distance` trait should expose `fn similarity_z(&self, other: &Self) -> f32`
for the FisherZ-transformed variant, defaulting to `atanh(similarity())`.

Cross-ref: CLAUDE.md §The Click ("object speaks for itself"), I1 Codec Regime
Split (`CodecRoute`), `contract::cam::DistanceTableProvider` (existing trait for
ADC), `ndarray::hpc::bitwise::hamming_distance_raw`, `ndarray::hpc::palette_distance`.
53 changes: 53 additions & 0 deletions .claude/board/TECH_DEBT.md
Original file line number Diff line number Diff line change
Expand Up @@ -1071,3 +1071,56 @@ Cross-ref: `container_bs/dn_redis.rs`; `callcenter-membrane-v1.md` §§595–803
| Diagnostic | TD-INT-11 |

All 14 items are additive (add call site). Zero items require type creation or code deletion.

## 2026-04-26 — TD-DIST-1: Distance trait missing from contract (type-intrinsic dispatch)

**Status:** Open
**Severity:** Medium (no runtime cost today — hard-coded dispatch works — but blocks
generic SoA distance sweeps)

The contract has `CodecRoute` (Passthrough | CamPq) naming the regime and
`DistanceTableProvider` for ADC, but no unified `Distance` trait that each
carrier type implements. Today each call site hard-codes which distance
function to use (`hamming_distance_raw` for Binary16K, `adc_distance` for
CamPq, `cosine_f64_simd` for Vsa16kF32). This works but prevents writing
generic distance sweeps over mixed SoA columns.

**Fix:** Add `pub trait Distance` to `contract::cam` (or a new `contract::distance`
module). Implement for `[u64; 256]`, `CamPqCode`, `PaletteEdge`, `Vsa16kF32`.
Include `similarity_z()` for FisherZ-transformed cosine averaging.
See EPIPHANIES.md 2026-04-26 distance-dispatch entry for full design.

**Blocked by:** nothing — pure additive.
**Unblocks:** generic SoA distance accumulation, multi-column weighted distance,
render-frame similarity for force-directed layout (CAM-PQ pruning + HHTL cascade).

## 2026-04-26 — TD-DIST-2: vector_ops.rs still has scalar dot/norm/cosine (4 loops)

**Status:** Open
**Severity:** High (hot path in DataFusion UDF — L2/cosine queries)

`vector_ops.rs` lines 140, 160, 179, 189 have 4 independent scalar
`.iter().map().sum()` loops for dot product, norm², cosine similarity.
Should swap for `ndarray::hpc::heel_f64x8::{dot_f64_simd, cosine_f64_simd}`.
Estimated 8-12× speedup (chunked F64x8 FMA vs scalar).

## 2026-04-26 — TD-DIST-3: bgz17 Palette::nearest() uses brute-force 256×17 L1

**Status:** Open
**Severity:** Medium (build-time hot path for palette construction)

`bgz17/palette.rs` lines 56-65 iterate all 256 centroids per query.
Should use precomputed distance table from `ndarray::hpc::palette_distance`.
Estimated 100× speedup for encoding (O(1) table lookup vs O(256) L1 per query).

## 2026-04-26 — Paid Debt: TD-DIST-1/2/3 all shipped in commit 8603148

- **TD-DIST-1** (Distance trait): `contract::distance` module with `Distance` trait,
`fisher_z_inverse`, `mean_similarity_fisher`. Impls for `[u64; 256]`, `[u8; 6]`, `[u8; 3]`.
11 tests. Status: **PAID**.
- **TD-DIST-2** (vector_ops scalar→SIMD): `cosine_distance`, `cosine_similarity`,
`dot_product_distance`, `dot_product_similarity` all now delegate to
`ndarray::hpc::heel_f64x8::cosine_f32_to_f64_simd` / `dot_f64_simd`. Status: **PAID**.
- **TD-DIST-3** (Palette distance table): `Palette::build_distance_table()` →
`PaletteDistanceTable` with O(1) `distance(a, b)` and `edge_distance(a, b)`.
128 KB table, L2-resident. Status: **PAID**.
2 changes: 1 addition & 1 deletion .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ concurrency:

env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C debuginfo=1"
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
RUST_BACKTRACE: "1"
CARGO_INCREMENTAL: "0"

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/rust-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ on:

env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C debuginfo=1"
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
RUST_BACKTRACE: "1"
CARGO_INCREMENTAL: "0"
CARGO_BUILD_JOBS: "1"
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/rust-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ concurrency:

env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C debuginfo=1"
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"
RUST_BACKTRACE: "1"
CARGO_INCREMENTAL: "0"

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ concurrency:

env:
CARGO_TERM_COLOR: always
RUSTFLAGS: "-C debuginfo=1"
RUSTFLAGS: "-C debuginfo=1 -C target-cpu=x86-64-v3"

jobs:
format:
Expand Down
10 changes: 9 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# lance-graph — Railway compile-test image
# lance-graph — Railway compile-test image (AVX2 default)
# Verifies the workspace builds cleanly (core + bgz17 + planner + contract)
# Requires Rust 1.94.0 (LazyLock, modern std APIs)
#
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
# AVX-512 pinned variant: see Dockerfile.avx512
#
# Build: docker build -t lance-graph-test .
# Run: docker run --rm lance-graph-test

Expand Down Expand Up @@ -38,6 +41,11 @@ COPY crates/bgz17/Cargo.toml crates/bgz17/Cargo.toml
# Copy source
COPY crates/ crates/

# Default target: x86-64-v3 (AVX2) — runs on GitHub CI and most servers.
# Use Dockerfile.avx512 for x86-64-v4 (AVX-512) on Skylake-X / Ice Lake / Sapphire Rapids.
# The .cargo/config.toml pins x86-64-v4 for LOCAL builds; override here for portability.
ENV RUSTFLAGS="-C target-cpu=x86-64-v3"

# Build bgz17 standalone (zero deps, fast check)
RUN cargo build --release --manifest-path crates/bgz17/Cargo.toml 2>&1 \
&& echo "=== BGZ17 BUILD OK ==="
Expand Down
3 changes: 3 additions & 0 deletions Dockerfile.avx512
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@
#
# ONLY deploy on AVX-512 hardware.
#
# CPU detection & SIMD dispatch documentation: see Dockerfile.md
# Portable (AVX2) variant: see Dockerfile
#
# Build: docker build -f Dockerfile.avx512 -t lance-graph-avx512 .
# Run: docker run --rm lance-graph-avx512

Expand Down
111 changes: 111 additions & 0 deletions Dockerfile.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# lance-graph Docker CPU Detection & SIMD Dispatch

## Three-Tier Build Strategy

| Target | Dockerfile | RUSTFLAGS | Use case |
|---|---|---|---|
| **Portable (AVX2)** | `Dockerfile` | `-C target-cpu=x86-64-v3` | GitHub CI, general servers |
| **AVX-512 pinned** | `Dockerfile.avx512` | `-C target-cpu=x86-64-v4` | Production (Skylake-X+) |
| **HHTL-D TTS** | `Dockerfile.hhtld` | (inherits) | TTS inference container |
| **Local dev** | `.cargo/config.toml` | `-C target-cpu=x86-64-v4` | Developer machines |

## How lance-graph Uses SIMD

lance-graph delegates all SIMD work to **ndarray** (mandatory dependency).
ndarray's `src/simd.rs` polyfill provides the dispatch:

```
Consumer code (lance-graph):
ndarray::hpc::bitwise::hamming_distance_raw(a, b)
ndarray::simd::F32x16::mul_add(b, c)
ndarray::hpc::renderer::integrate_simd(pos, vel, dt, damp)

Polyfill (ndarray simd.rs):
┌─────────────────────────┐
│ compile-time target_cpu │
├─────────┬───────────────┤
│ v4 │ v3 / lower │
├─────────┼───────────────┤
│ __m512 │ 2× __m256 or │
│ native │ scalar loop │
└─────────┴───────────────┘
+
┌──────────────────────────────┐
│ runtime LazyLock<Tier> │
│ is_x86_feature_detected!() │
│ → per-function AVX-512 even │
│ when compiled at v3 │
└──────────────────────────────┘
```

### What lance-graph calls from ndarray SIMD

| lance-graph location | ndarray function | What it does |
|---|---|---|
| `driver.rs` (shader hot loop) | `bitwise::hamming_distance_raw` | Content-plane Hamming pre-pass (16K-bit fingerprints) |
| `vector_ops.rs` (DataFusion UDF) | `bitwise::hamming_distance_raw` | SQL `hamming_distance()` function |
| `fingerprint.rs` (graph) | `bitwise::hamming_distance_raw` | Graph fingerprint similarity |
| `blasgraph/types.rs` | Own AVX-512/AVX2 Hamming | Hand-rolled (predates ndarray integration) |

### `.cargo/config.toml` vs CI RUSTFLAGS

**Important:** `RUSTFLAGS` env var **replaces** (not appends to) the `rustflags`
array in `.cargo/config.toml`. This is a Cargo design decision.

lance-graph's `.cargo/config.toml` sets `target-cpu=x86-64-v4` for local dev.
CI workflows set `RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"` which
**overrides** config.toml entirely. The CI binary targets AVX2.

This is intentional:
- Local dev: maximum SIMD (AVX-512, everything inlined)
- CI: portable (AVX2, runtime detection for anything higher)
- Production Docker: choose `Dockerfile` (AVX2) or `Dockerfile.avx512`

## AMX Detection

Intel AMX (Sapphire Rapids+) is detected at runtime by ndarray:
`ndarray::hpc::amx_matmul::amx_available()` checks CPUID + OS XSAVE support.
AMX kernels are always compiled in and gated at call sites. No Dockerfile
or RUSTFLAGS change needed — it works with any `target-cpu`.

## NEON (ARM / aarch64 / Raspberry Pi)

ndarray detects NEON automatically on aarch64 (it's mandatory). The `dotprod`
extension (Pi 5 / A76+) is runtime-detected for 4× int8 throughput.
lance-graph inherits this via ndarray; no ARM-specific configuration needed.

## Choosing the Right Dockerfile

```
GitHub CI / PR checks → Dockerfile (AVX2, -C target-cpu=x86-64-v3)
Railway / production → Dockerfile.avx512 (-C target-cpu=x86-64-v4)
TTS inference → Dockerfile.hhtld (downloads codebooks + runs decoder)
Raspberry Pi / ARM → Dockerfile (NEON auto-detected at runtime)
Maximum compatibility → docker build --build-arg RUSTFLAGS="-C target-cpu=x86-64"
```

## Verifying CPU Features

```bash
# Inside the container:
cat /proc/cpuinfo | grep -oP 'avx512\w+' | sort -u

# From Rust (ndarray):
use ndarray::hpc::simd_caps::simd_caps;
println!("{:?}", simd_caps()); // CpuCaps { avx512: true, avx2: true, fma: true, ... }
```

## Build Examples

```bash
# Default (AVX2) — safe everywhere
docker build -t lance-graph-test .

# AVX-512 pinned — production servers
docker build -f Dockerfile.avx512 -t lance-graph-avx512 .

# TTS inference
docker build -f Dockerfile.hhtld \
--build-arg RELEASE_TAG=v0.1.0 \
-t lance-graph-tts:v0.1.0 .
```
53 changes: 53 additions & 0 deletions crates/bgz17/src/palette.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,24 @@ impl Palette {
best_idx
}

/// Build a precomputed distance table for O(1) inter-centroid distance.
///
/// Returns a 256×256 u16 table where `table[i][j]` = L1 distance between
/// `entries[i]` and `entries[j]`. Used by the renderer and cascade skip
/// for fast palette-edge distance without recomputing L1 per query.
pub fn build_distance_table(&self) -> PaletteDistanceTable {
let k = self.entries.len();
let mut table = vec![0u16; 256 * 256];
for i in 0..k {
for j in i..k {
let d = self.entries[i].l1(&self.entries[j]) as u16;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid truncating palette L1 distances to u16

Base17::l1 returns u32 and can exceed 65,535 (17 dimensions of i16 differences), but build_distance_table narrows each value to u16. This silently wraps large distances before they are used by PaletteDistanceTable::distance/edge_distance, causing incorrect palette-edge scores and wrong decisions in any path relying on the precomputed table.

Useful? React with 👍 / 👎.

table[i * 256 + j] = d;
table[j * 256 + i] = d;
}
}
PaletteDistanceTable { table, size: k }
}

/// Encode an SpoBase17 edge to palette indices.
pub fn encode_edge(&self, edge: &SpoBase17) -> PaletteEdge {
PaletteEdge {
Expand Down Expand Up @@ -226,6 +244,41 @@ impl Palette {
}
}

/// Precomputed 256×256 L1 distance table for O(1) inter-centroid lookup.
///
/// Built once from a `Palette` via `palette.build_distance_table()`.
/// Used by the cascade skip (HHTL), renderer force-directed layout, and
/// any path that needs repeated palette-edge distance without recomputing L1.
///
/// Memory: 256×256×2 = 128 KB (fits L2 cache). Build cost: O(k²×17).
#[derive(Clone)]
pub struct PaletteDistanceTable {
table: Vec<u16>,
size: usize,
}

impl PaletteDistanceTable {
/// O(1) distance between two palette indices.
#[inline]
pub fn distance(&self, a: u8, b: u8) -> u16 {
self.table[a as usize * 256 + b as usize]
}

/// Number of active entries (≤ 256).
pub fn size(&self) -> usize { self.size }

/// Distance between two PaletteEdges (sum of S + P + O distances).
#[inline]
pub fn edge_distance(&self, a: PaletteEdge, b: PaletteEdge) -> u32 {
self.distance(a.s_idx, b.s_idx) as u32
+ self.distance(a.p_idx, b.p_idx) as u32
+ self.distance(a.o_idx, b.o_idx) as u32
}

/// Memory footprint in bytes.
pub fn byte_size(&self) -> usize { self.table.len() * 2 }
}

/// Palette resolution: trade compression vs accuracy.
///
/// Edge count determines optimal palette size:
Expand Down
Loading
Loading