|
| 1 | +# KNOWLEDGE: Vertical SIMD — W1a Consumer Contract |
| 2 | + |
| 3 | +## READ BY: |
| 4 | +- `savant-architect` agent — before designing any new public `pub fn` in `src/simd_*.rs` |
| 5 | +- `sentinel-qa` agent — when auditing the saturating / bounds-aware / scalar-fallback discipline on a SIMD addition |
| 6 | +- Any contributor opening a PR that adds an `impl` block on `F32x16` / `I8x16` / `U8x32` / `U64x8` etc. |
| 7 | +- Any contributor adding a new public function under `src/simd_ops.rs` or `src/simd_int_ops.rs` |
| 8 | + |
| 9 | +## P0 TRIGGERS: |
| 10 | +- About to file a PR adding `pub fn` to `src/simd_*.rs` → read this first |
| 11 | +- About to claim "X SIMD instruction saturates by ISA" → read §"VPABSB correction" first |
| 12 | +- Five `TD-NDARRAY-SIMD-*` issues are about to be filed against this repo from the `AdaWorldAPI/lance-graph` consumer contract → those are the W1a queue described below |
| 13 | + |
| 14 | +--- |
| 15 | + |
| 16 | +## Why this doc exists |
| 17 | + |
| 18 | +`AdaWorldAPI/lance-graph` (the obligatory spine for the Ada architecture) carries a hard architectural invariant: **all SIMD must come from `ndarray::simd` via the polyfill — `simd.rs` + `simd_ops.rs` > `simd_{type}.rs` per-arch. Raw intrinsics outside `ndarray/src/simd_*.rs` are a violation**, enforced by the `simd-savant` agent at `lance-graph:.claude/agents/simd-savant.md`. |
| 19 | + |
| 20 | +A PRE-MERGE audit of `lance-graph` main on 2026-05-16 surfaced **158 raw-intrinsic violations across 5 consumer crates** plus **3 missing primitives** in `ndarray::simd` that block clean remediation. The lance-graph side is staged to migrate (in 5 sequential consumer PRs); the missing primitives must land in ndarray FIRST. This doc is the contract for what those primitives must do, with implementation details called out where consumer-side correctness depends on getting the semantics right. |
| 21 | + |
| 22 | +The architectural shape this doc serves is captured in detail at: |
| 23 | +- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — the canonical reference, "alien magic" framing |
| 24 | +- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — the consumer-side enforcement card |
| 25 | +- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` (2026-05-16) — the 158-violation finding |
| 26 | + |
| 27 | +--- |
| 28 | + |
| 29 | +## The pattern (one paragraph) |
| 30 | + |
| 31 | +ndarray's SIMD surface is shaped to fit exactly what the Ada stack vertically needs — not as a generic library that consumers wrap, but as **struct methods on typed wrappers** (`I8x16`, `U8x32`, `F32x16`, `U64x8`, …) plus **closure-parameterized batch primitives** that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zero `cfg(target_arch)`, zero runtime feature-detect — they call `I8x16::from_i4_packed_u64(...)`, `I8x16::saturating_abs(...)`, `batch_packed_i4_16(..., |lanes, aux| { ... })`. The polyfill owns the runtime feature dispatch, lane chunking, tail handling, and scalar fallback. Per-arch code lives in `simd_avx512.rs` / `simd_neon.rs` / `simd_wasm.rs`; nothing arch-specific leaks above the `src/simd*.rs` namespace. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## VPABSB correction (P0 — read before implementing saturating_abs) |
| 36 | + |
| 37 | +**`_mm512_abs_epi8` (VPABSB) does NOT saturate `i8::MIN`.** The Intel intrinsic returns the same bit pattern for `0x80` — i.e., `abs(i8::MIN) = i8::MIN` because `+128` does not fit in `i8`. An earlier draft of the consumer contract (2026-05-16 morning) claimed the instruction saturated `i8::MIN → 127` by ISA. Codex caught this on `lance-graph` PR #400; the correction is binding. |
| 38 | + |
| 39 | +**Correct AVX-512 implementation of `I8x16::saturating_abs`:** |
| 40 | + |
| 41 | +```rust |
| 42 | +// AVX-512 path |
| 43 | +let raw_abs = unsafe { _mm512_abs_epi8(self.0) }; |
| 44 | +let clamped = unsafe { |
| 45 | + _mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f)) |
| 46 | +}; |
| 47 | +I8x16(clamped) |
| 48 | +``` |
| 49 | + |
| 50 | +The mechanic: |
| 51 | +1. **VPABSB** computes the bit-pattern absolute value lane-wise. For `0x80` it returns `0x80` (the bit pattern of `+128` interpreted as unsigned). For everything else, `abs(x) < 0x80`, so the result fits in `i8` correctly. |
| 52 | +2. **VPMINUB** (unsigned-byte min) then clamps `0x80` (=128 unsigned) down to `0x7f` (=127). All lanes with `abs(x) < 0x80` are unaffected because `min_epu8(x, 0x7f) = x` for `x ≤ 0x7f` and `min_epu8(0x80, 0x7f) = 0x7f`. |
| 53 | + |
| 54 | +Equivalent NEON: |
| 55 | +```rust |
| 56 | +// vqabsq_s8 is hardware-saturating (the `q` suffix means saturating) |
| 57 | +I8x16(unsafe { vqabsq_s8(self.0) }) |
| 58 | +// Returns 127 for i8::MIN, identical to the AVX-512 + clamp result |
| 59 | +``` |
| 60 | + |
| 61 | +Scalar fused-loop: |
| 62 | +```rust |
| 63 | +for lane in 0..16 { |
| 64 | + out[lane] = input[lane].saturating_abs(); // stdlib, well-defined |
| 65 | +} |
| 66 | +``` |
| 67 | + |
| 68 | +**Mandatory test** (binding for the PR): |
| 69 | +```rust |
| 70 | +#[test] |
| 71 | +fn saturating_abs_i8_min_matches_across_backends() { |
| 72 | + let input = I8x16::splat(i8::MIN); |
| 73 | + let result = input.saturating_abs(); |
| 74 | + assert_eq!(result.lane_i8::<0>(), i8::MAX); |
| 75 | + // ... and assert all 16 lanes equal i8::MAX |
| 76 | +} |
| 77 | +``` |
| 78 | + |
| 79 | +Any saturating-abs primitive in ndarray that does NOT produce `i8::MAX` for `i8::MIN` input is broken. The widen-then-negate trick (i8 → i64, then negate, then compare against threshold) used in `lance-graph` PR #398's mul.rs is a different mechanism and **not a substitute** — the new `I8x16::saturating_abs` must produce the saturating result in the same byte-wide register without widening, because downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops. |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## W1a queue — 5 primitives ndarray must ship |
| 84 | + |
| 85 | +Each is a tight-scope PR. Recommended: one branch per primitive, parallel review. |
| 86 | + |
| 87 | +### W1a-#1 — `TD-NDARRAY-SIMD-UNPACK-I4-16D` |
| 88 | + |
| 89 | +**Purpose:** unpack a `u64` of 16 packed signed nibbles (i4) into an `I8x16` with sign extension. Plus the closure-batch entry that the consumer's `mul::i4_eval::batch` dispatch calls. |
| 90 | + |
| 91 | +**API surface:** |
| 92 | +```rust |
| 93 | +impl I8x16 { |
| 94 | + /// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes |
| 95 | + /// (sign-extended). Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf, i8). |
| 96 | + pub fn from_i4_packed_u64(packed: u64) -> Self; |
| 97 | + |
| 98 | + /// Const-folded lane extract. |
| 99 | + pub fn lane_i8<const N: usize>(self) -> i8; |
| 100 | +} |
| 101 | + |
| 102 | +/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair. |
| 103 | +/// Bounds-aware tail handling; scalar fallback on unsupported arch. |
| 104 | +pub fn batch_packed_i4_16<E, F>( |
| 105 | + packed: &[u64], |
| 106 | + aux: &[i8], |
| 107 | + out: &mut [E], |
| 108 | + f: F, |
| 109 | +) |
| 110 | +where |
| 111 | + F: Fn(I8x16, i8) -> E + Sync + Send, |
| 112 | + E: Copy; |
| 113 | +``` |
| 114 | + |
| 115 | +**Per-arch implementation hints:** |
| 116 | +- **AVX-512:** load 16 × i8 from u64 via `_mm_cvtsi64_si128` + extend with `_mm512_cvtepi8_epi16` + nibble shuffle (PEXTRB or VPSHUFB with a mask LUT), then sign-extend by `_mm_cvtepi8_epi16`. Bench against alternative: PDEP (`_pdep_u64` × 2) into two u64 halves, then load + `vpmovsxbw` for sign-extend. Pick whichever benches faster on Zen4 + Sapphire Rapids. |
| 117 | +- **NEON:** `vld1_u8` 8 bytes into `uint8x8_t`, then nibble-split via `vshl_n_s8(v, 4)` and `vshr_n_s8(v, 4)`. Sign-extension is automatic from `vshr_n_s8`. |
| 118 | +- **Scalar:** fused loop reading 16 nibbles via `((packed >> (4*i)) & 0xf) as i8` with manual sign-extend (`if x > 7 { x - 16 } else { x }`). |
| 119 | + |
| 120 | +**Consumer call site:** `lance-graph:crates/lance-graph-contract/src/mul.rs::i4_eval::batch` (5 batch fns over `QualiaI4_16D(u64)`). The closure-batch absorbs the 5 fns into closures + classifier names. |
| 121 | + |
| 122 | +**PR #398 codex P1 (NEON OOB at `len==2`) is closed by this primitive** because the batch entry owns tail handling; consumers no longer reach for raw `vld1q_u64(&qualia[i+1].0 as *const u64)`. |
| 123 | + |
| 124 | +--- |
| 125 | + |
| 126 | +### W1a-#2 — `TD-NDARRAY-SIMD-SATURATING-ABS-I8` |
| 127 | + |
| 128 | +**Purpose:** byte-wide saturating absolute value. Closes codex P2 i8::MIN divergence on `lance-graph` PR #398 by giving consumers a single source-of-truth. |
| 129 | + |
| 130 | +**API surface:** |
| 131 | +```rust |
| 132 | +impl I8x16 { |
| 133 | + /// Lane-wise saturating absolute value. saturating_abs(i8::MIN) == i8::MAX. |
| 134 | + /// All lanes are independently saturated. |
| 135 | + pub fn saturating_abs(self) -> Self; |
| 136 | +} |
| 137 | + |
| 138 | +impl I8x32 { |
| 139 | + pub fn saturating_abs(self) -> Self; // parity |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +**Per-arch implementation:** see § "VPABSB correction" above. The AVX-512 path is `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`; NEON is `vqabsq_s8`; scalar is `i8::saturating_abs`. |
| 144 | + |
| 145 | +**Consumer:** `lance-graph:crates/lance-graph-contract/src/mul.rs` (Direction-B fix from PP-16 preflight-drift-auditor 2026-05-16). Spec line 233 of `lance-graph:.claude/specs/pr-sprint-13-simd-i4.md`: `|signed_mantissa| ≤ 1 → ValleyOfDespair` represents weak rule signal, NOT sign-extreme; `i8::MIN` must classify as `Slope/Plateau`, not `ValleyOfDespair`. Scalar in PR #398 is buggy (uses `unsigned_abs() as i8` which wraps `i8::MIN → -128`); the new primitive lets the fix be a one-liner: `lanes.saturating_abs().lane_i8::<0>()` ≤ 1. |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +### W1a-#3 — `TD-NDARRAY-SIMD-GATHER` |
| 150 | + |
| 151 | +**Purpose:** SIMD gather for palette / lookup-table consumers. Currently `bgz17/src/simd.rs:88` inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation). |
| 152 | + |
| 153 | +**API surface:** |
| 154 | +```rust |
| 155 | +impl U16x8 { |
| 156 | + /// Gather 8 u16 values from `table` at the given indices. |
| 157 | + /// indices[i] >= table.len() => panic in debug, scalar-fallback safe in release. |
| 158 | + pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self; |
| 159 | +} |
| 160 | + |
| 161 | +/// Convenience: lookup 8 bytes from a u8 LUT by u16 indices. |
| 162 | +pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8; |
| 163 | +``` |
| 164 | + |
| 165 | +**Per-arch implementation:** |
| 166 | +- **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather). |
| 167 | +- **NEON:** no native gather instruction. Scalar loop is fine for 8 lanes — `(0..8).map(|i| table[indices.lane(i) as usize])`. |
| 168 | +- **Scalar:** identical to the NEON fallback. |
| 169 | + |
| 170 | +**Bounds:** `gather_u16` MUST validate `max(indices) < table.len()` before the SIMD gather call (debug panic; in release, fall through to scalar with `.get()` for safety). |
| 171 | + |
| 172 | +--- |
| 173 | + |
| 174 | +### W1a-#4 — `TD-NDARRAY-SIMD-PREFETCH` |
| 175 | + |
| 176 | +**Purpose:** cross-arch prefetch hint. Currently `bgz17/src/prefetch.rs:96,100` inlines `_mm_prefetch` and `_prefetch` directly. |
| 177 | + |
| 178 | +**API surface:** |
| 179 | +```rust |
| 180 | +/// Hint that `ptr` will be read soon; load into L1 (T0) cache. |
| 181 | +pub fn prefetch_read_t0(ptr: *const u8); |
| 182 | + |
| 183 | +/// Hint to load into L2 (T1) cache. |
| 184 | +pub fn prefetch_read_t1(ptr: *const u8); |
| 185 | + |
| 186 | +/// Hint to load into L3 (T2) cache. |
| 187 | +pub fn prefetch_read_t2(ptr: *const u8); |
| 188 | +``` |
| 189 | + |
| 190 | +**Per-arch implementation:** |
| 191 | +- **x86_64:** `_mm_prefetch(ptr as *const i8, _MM_HINT_T0)` / `_T1` / `_T2`. |
| 192 | +- **aarch64:** `__pld(ptr)` via inline asm `prfm pldl1keep, [ptr]` (T0), `pldl2keep` (T1), `pldl3keep` (T2). Or wrap `core::intrinsics::prefetch_read_data` if/when stable. |
| 193 | +- **Other arches:** no-op (the prefetch contract is a hint, not a guarantee — silent no-op is correct). |
| 194 | + |
| 195 | +**Safety:** `ptr` is allowed to be invalid (prefetch on an unmapped page is a hint that the CPU silently drops on x86). No `assert!` needed. |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +### W1a-#5 — `TD-NDARRAY-SIMD-POPCOUNT-U64` |
| 200 | + |
| 201 | +**Purpose:** lane-wise popcount of u64 vectors. Currently `holograph/hamming.rs` and `lance-graph:crates/lance-graph/src/graph/blasgraph/types.rs` use `_mm512_popcnt_epi64` directly for Hamming-distance reduction. |
| 202 | + |
| 203 | +**API surface:** |
| 204 | +```rust |
| 205 | +impl U64x8 { |
| 206 | + /// Lane-wise population count. Each lane returns its u64 bit-count (0..=64). |
| 207 | + pub fn popcnt(self) -> Self; |
| 208 | + |
| 209 | + /// XOR + lane-wise popcount + horizontal sum across 8 lanes. |
| 210 | + /// Optimized for Hamming-distance reductions. |
| 211 | + pub fn xor_popcount(self, other: Self) -> u64; |
| 212 | +} |
| 213 | + |
| 214 | +impl U64x4 { |
| 215 | + pub fn popcnt(self) -> Self; // AVX2 parity |
| 216 | +} |
| 217 | +``` |
| 218 | + |
| 219 | +**Per-arch implementation:** |
| 220 | +- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`. |
| 221 | +- **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB). |
| 222 | +- **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade. |
| 223 | +- **Scalar:** `u64::count_ones` fused loop. |
| 224 | + |
| 225 | +**Note:** the existing `ndarray::hpc::bitwise::popcount_raw` and `hamming_distance_raw` cover the slice case but DO NOT expose a lane-wise method. The new `U64x8::popcnt` fills that gap so consumers can compose Hamming-distance pipelines without dropping back to slice ops. |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +## W1.5 — DEFERRED primitives (gated on `lance-graph:crates/sigker` certification) |
| 230 | + |
| 231 | +Three more primitives are queued behind a certification gate. `crates/sigker` is `lance-graph`'s path-signature codec — it's pure-scalar Rust today (zero raw intrinsics, zero ndarray dep), and is positioned as the **Index-regime third encoding lane** alongside palette-distance (bgz17) and NSM tiling (deepnsm). It explicitly bypasses the `I-NOISE-FLOOR-JIRAK` iron rule (Jirak 2016 Berry-Esseen for weak-dependence data) via Hambly-Lyons 2010 path-signature uniqueness. |
| 232 | + |
| 233 | +When `jc Pillar 11` (Hambly-Lyons signature uniqueness on lance-graph paths) activates and sigker is benchmarked at production carrier widths, the W1.5 queue lights up: |
| 234 | + |
| 235 | +### W1.5-#6 — `TD-NDARRAY-SIMD-SIGNATURE-PDE-SWEEP` |
| 236 | + |
| 237 | +**Purpose:** signature kernel `〈S(X), S(Y)〉` via Goursat PDE — depth-∞ in O(T₁·T₂) flops, no signature materialization. |
| 238 | + |
| 239 | +**API surface (sketch):** |
| 240 | +```rust |
| 241 | +pub fn signature_pde_sweep<F>( |
| 242 | + x: &[F32x16], |
| 243 | + y: &[F32x16], |
| 244 | + kernel_fn: F, |
| 245 | +) -> f32 |
| 246 | +where |
| 247 | + F: Fn(F32x16, F32x16) -> F32x16; |
| 248 | +``` |
| 249 | + |
| 250 | +2D banded grid sweep; closure-parameterized kernel evaluator per step. |
| 251 | + |
| 252 | +### W1.5-#7 — `TD-NDARRAY-SIMD-RANDOMIZED-PROJECTION` |
| 253 | + |
| 254 | +Cuchiero-Schmocker-Teichmann (2021) randomized signatures: Gaussian random-matrix-vector update with `F32x16` state. Same closure-batch shape as W1a-#1, different lane type. |
| 255 | + |
| 256 | +### W1.5-#8 — `TD-NDARRAY-SIMD-LYNDON-PACK` |
| 257 | + |
| 258 | +Log-signature compression in the Lyndon basis of the free Lie algebra (7-13× compression, lossless). Pack/unpack primitives on `I16x16` state with combinatorial-index awareness. |
| 259 | + |
| 260 | +**No code needed today for W1.5.** Mentioned here so W1a additions are designed broad enough to compose with these later (in particular: the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7). |
| 261 | + |
| 262 | +--- |
| 263 | + |
| 264 | +## Acceptance criteria for each W1a PR |
| 265 | + |
| 266 | +Every PR adding a primitive from this queue MUST: |
| 267 | + |
| 268 | +1. **Implement all three backends** (AVX-512/AVX2/SSE, NEON, scalar). Missing scalar fallback is a P0 reject — the scalar path is the correctness anchor. |
| 269 | +2. **Document the saturating / overflow / signedness semantics** in the doc-comment. State explicitly what happens at edge cases (`i8::MIN`, `u8::MAX`, empty slices, indices out-of-range). |
| 270 | +3. **Mandatory parity test** asserting all three backends produce identical output on a fixed-seed randomized corpus that includes edge cases (`i8::MIN`, `0`, `i8::MAX`, mantissa = -128, etc.). Use `proptest` or `quickcheck` if available; otherwise hand-roll 50+ test inputs. |
| 271 | +4. **Bench against scalar** — record AVX-512 / NEON speedup ratios in the PR body. No SHIP/LAND gate required for the primitive PR itself (the consumer-side migration PRs will benchmark end-to-end), but a 0.5× anti-speedup ratio is a reject. |
| 272 | +5. **`// SAFETY:` comments on every `unsafe` block** per ndarray's existing discipline (`CLAUDE.md` § Hard Rules). |
| 273 | +6. **No new `is_*_feature_detected!` calls outside `src/hpc/simd_caps.rs`** — dispatch through the existing `simd_caps()` singleton. |
| 274 | +7. **PR description must include the consumer site** (`lance-graph:crates/lance-graph-contract/src/mul.rs:NNN`, etc.) so the post-merge consumer-PR has a known target. |
| 275 | + |
| 276 | +The `simd-savant` agent on the `lance-graph` side runs PRE-MERGE against every W1a PR to verify compliance. |
| 277 | + |
| 278 | +--- |
| 279 | + |
| 280 | +## Cross-references |
| 281 | + |
| 282 | +**ndarray-side (this repo):** |
| 283 | +- `src/simd.rs` — the public re-export hub. New primitives surface here. |
| 284 | +- `src/simd_avx512.rs` — AVX-512 typed wrappers (`I64x8`, `U64x8`, `I8x32`, `F32x16`, `F64x8`, …). |
| 285 | +- `src/simd_avx2.rs` — AVX2 typed wrappers (`U8x32`). |
| 286 | +- `src/simd_neon.rs` — NEON typed wrappers. |
| 287 | +- `src/simd_ops.rs` — high-level vector→vector ops (`add_f32`, `mul_f32`, …). |
| 288 | +- `src/simd_int_ops.rs` — integer batch ops (`add_i8`, `dot_i8`, `min_i8`, …). |
| 289 | +- `src/hpc/simd_caps.rs` — runtime feature-detect singleton. |
| 290 | +- `src/hpc/bitwise.rs` — already-exposed `hamming_distance_raw` + `popcount_raw` (slice case). |
| 291 | + |
| 292 | +**lance-graph-side (the consumer driving this contract):** |
| 293 | +- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — full architectural doc + per-workload table |
| 294 | +- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — PRE-MERGE audit gate |
| 295 | +- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` — the 158-violation finding |
| 296 | +- `AdaWorldAPI/lance-graph:.claude/board/TECH_DEBT.md` § `TD-NDARRAY-SIMD-*` and § `TD-SIMD-SWEEP-W*` — full debt ledger |
| 297 | +- `AdaWorldAPI/lance-graph:.claude/specs/pr-sprint-13-simd-i4.md` — D-CSV-13b spec (the consumer workload spec) |
| 298 | +- PR #398 (sprint-13 W-I1 retry) — the codex P1 (NEON OOB) + P2 (i8::MIN divergence) origin |
| 299 | +- PR #399 (`simd-savant` card + autoattended-pattern doc) — invariant declaration |
| 300 | +- PR #400 (architectural capture commit) — the canonical reference + tech-debt entries |
| 301 | + |
| 302 | +**External references:** |
| 303 | +- Intel Intrinsics Guide — `_mm512_abs_epi8` (VPABSB; does NOT saturate `i8::MIN`) |
| 304 | +- Intel Intrinsics Guide — `_mm512_min_epu8` (VPMINUB; unsigned-byte minimum, used to clamp the VPABSB result) |
| 305 | +- Intel Intrinsics Guide — `_mm512_popcnt_epi64` (VPOPCNTDQ; AVX-512 feature `avx512vpopcntdq`) |
| 306 | +- Intel Intrinsics Guide — `_mm256_i32gather_epi32` (VPGATHERDD AVX2) |
| 307 | +- ARM Architecture Reference — VQABS (`vqabsq_s8`, hardware-saturating) |
| 308 | +- ARM Architecture Reference — VCNT (`vcntq_u8`, byte-wise popcount) |
| 309 | +- Hambly & Lyons (2010), "Uniqueness for the signature of a path of bounded variation and the reduced path group" |
| 310 | +- Cuchiero, Schmocker & Teichmann (2021), "Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality" |
| 311 | +- Jirak (2016), "Berry-Esseen theorems under weak dependence" — the iron rule sigker bypasses |
| 312 | + |
| 313 | +## Litmus tests (for any contributor proposing an addition to this queue) |
| 314 | + |
| 315 | +> **Does the new primitive go on a typed-wrapper struct, or as a free function?** |
| 316 | +> Free function = reject; the surface fragments. Struct method = accept. |
| 317 | +
|
| 318 | +> **Does the doc-comment state the edge-case behavior (saturating? wrapping? UB? scalar-fallback?)?** |
| 319 | +> Missing = reject. The consumer needs to know without reading the code. |
| 320 | +
|
| 321 | +> **Are all three backends implemented (AVX*, NEON, scalar)?** |
| 322 | +> Missing scalar = reject. Scalar is the correctness anchor. |
| 323 | +
|
| 324 | +> **Is there a parity test asserting all three backends produce identical output on a fixed-seed randomized corpus including edge cases?** |
| 325 | +> Missing = reject. The codex P2 i8::MIN divergence on `lance-graph` PR #398 happened because no such test existed. |
| 326 | +
|
| 327 | +> **Is the consumer site cited in the PR description?** |
| 328 | +> Missing = reject. We're shipping primitives for known workloads, not speculative ones. |
0 commit comments