|
| 1 | +# SIMD Tier Technical Debt Audit |
| 2 | + |
| 3 | +> **Design principle:** `crate::simd::*` exposes the maximum hardware performance available on the current silicon via runtime-detected polyfill. Every CPU trick is applied at its tier. Consumers must never get scalar code when the hardware offers a SIMD path. The polyfill **is** the dispatch layer, not the fallback. |
| 4 | +
|
| 5 | +## Audit scope (2026-05-20) |
| 6 | + |
| 7 | +Every finding below was verified by reading the file at the cited line range. Files read end-to-end or in dispatch-relevant sections: |
| 8 | + |
| 9 | +| File | LoC read | Why | |
| 10 | +|---|---|---| |
| 11 | +| `src/simd.rs` | 1–720 (full) | Top-level dispatch | |
| 12 | +| `src/simd_amx.rs` | 1–421 (full) | AMX detection + VNNI dispatch | |
| 13 | +| `src/hpc/amx_matmul.rs` | 1–671 (full) | Public ndarray-typed matmul API | |
| 14 | +| `src/hpc/bf16_tile_gemm.rs` | 1–205 (full) | AMX tile kernel | |
| 15 | +| `src/hpc/simd_caps.rs` | 1–514 (full) | Capability singleton | |
| 16 | +| `src/hpc/simd_dispatch.rs` | 1–361 (full) | Frozen dispatch table | |
| 17 | +| `src/backend/native.rs` | 1–763 (full) | Backend BLAS-1 + GEMM dispatch | |
| 18 | +| `src/backend/kernels_avx512.rs` | 1–100 + grep | AVX-512 BLAS-1 kernels | |
| 19 | +| `src/simd_neon_bf16.rs:130–204` | stub section | BF16 NEON stubs | |
| 20 | +| `src/simd_neon_dotprod.rs:96–157` | stub section | F16 NEON stub | |
| 21 | +| `src/simd_avx512.rs:680–720, 2360–2420` | VBMI + BF16 conv | VBMI permute, BF16 batch | |
| 22 | +| `src/hpc/bgz17_bridge.rs:35–135` | dispatch sites | bgz17 L1 kernels | |
| 23 | +| `src/hpc/nibble.rs:1–270` | dispatch sites | Nibble ops | |
| 24 | +| `src/hpc/quantized.rs:444–630` | GEMM kernels | bf16/int8 GEMM | |
| 25 | +| `src/hpc/vnni_gemm.rs:1–130` | VNNI INT8 GEMM | VNNI dispatch | |
| 26 | + |
| 27 | +Files NOT yet read for this audit (next sweep): |
| 28 | + |
| 29 | +- `src/simd_avx512.rs` remainder (~3700 LoC unread) |
| 30 | +- `src/simd_avx2.rs` (2805 LoC unread) |
| 31 | +- `src/simd_neon.rs` (1917 LoC unread) |
| 32 | +- `src/simd_scalar.rs` (1308 LoC unread) |
| 33 | +- `src/simd_half.rs` (762 LoC unread) |
| 34 | +- `src/simd_nightly/*` |
| 35 | +- HPC modules: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs`, `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs`, `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs`, all of `src/hpc/styles/*` |
| 36 | + |
| 37 | +## Microscopic silicon tier matrix |
| 38 | + |
| 39 | +| CPU | AVX-512F | VNNI | VBMI | BF16 | FP16 | AMX-INT8 | AMX-BF16 | AVX-VNNI-INT8 | |
| 40 | +|---|---|---|---|---|---|---|---|---| |
| 41 | +| Skylake-X / SP / W (2017) | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | |
| 42 | +| Cascade Lake (2019) | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | |
| 43 | +| Cooper Lake (2020) | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | |
| 44 | +| Ice Lake-SP / Tiger Lake (2021) | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | |
| 45 | +| Sapphire Rapids (2023) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | |
| 46 | +| Granite Rapids (2024) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | (+ AMX-FP16, AMX-COMPLEX) | |
| 47 | +| Zen 4 (Genoa, Ryzen 7000, 2022) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | |
| 48 | +| Zen 5 (2024) | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | |
| 49 | +| Arrow Lake / Lunar Lake (2024) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | |
| 50 | +| Pi 5 / Orange Pi 5 (A76, ARMv8.2) | (NEON) | (dotprod) | – | (bf16+) | (fp16) | – | – | – | |
| 51 | +| Pi 4 (A72, ARMv8.0) | (NEON) | – | – | – | – | – | – | – | |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## Findings — CRITICAL |
| 56 | + |
| 57 | +### TD-T1 · `src/hpc/amx_matmul.rs:319-327` · stub fast path |
| 58 | + |
| 59 | +`matmul_bf16_to_f32`: `if amx_available() { bf16_gemm_f32(...) } else { bf16_gemm_f32(...) }` — both arms identical. Comment 320-323 admits "Future: AMX-tiled fast path. Today we route through the same f32 reference kernel; correctness is identical regardless of hardware. The `amx_available()` branch is preserved so callers can be sure the AMX detection runs." |
| 60 | + |
| 61 | +Working AMX kernel exists at `src/hpc/bf16_tile_gemm.rs::bf16_tile_gemm_16x16` (lines 39-87) — full TDPBF16PS dispatch, tested at lines 151-204. |
| 62 | + |
| 63 | +**Hit on Sapphire Rapids / Granite Rapids:** scalar instead of 256-mul-add/instr tile op. |
| 64 | + |
| 65 | +### TD-T2 · `src/hpc/amx_matmul.rs:351-356` · stub fast path |
| 66 | + |
| 67 | +`matmul_f32` AMX branch: converts f32 → BF16 and calls `bf16_gemm_f32` (scalar). Same shape as TD-T1. |
| 68 | + |
| 69 | +### TD-T3 · `src/hpc/amx_matmul.rs:395-412` · stub fast path + wrong fallback |
| 70 | + |
| 71 | +`matmul_i8_to_i32` AMX branch: shifts LHS i8 → u8 (+128) and calls `int8_gemm_i32` (scalar reference). |
| 72 | + |
| 73 | +Two debts here: |
| 74 | +1. AMX path never reaches `tile_dpbusd` (the working primitive at `amx_matmul.rs:146-150`). |
| 75 | +2. The fallback when AMX is absent should be `int8_gemm_vnni` (at `src/hpc/vnni_gemm.rs:46`), which dispatches AVX-512 VNNI `VPDPBUSD` (64 MACs/instr) — but it calls the scalar `int8_gemm_i32` directly. |
| 76 | + |
| 77 | +**Hit on Sapphire Rapids:** ~256× slower than AMX TDPBUSD. |
| 78 | +**Hit on Cascade Lake / Ice Lake-SP / Zen 4 (AVX-512 + VNNI but no AMX):** ~64× slower than VNNI. |
| 79 | + |
| 80 | +### TD-T4 · `src/hpc/quantized.rs:444-481` · scalar kernel labeled GEMM |
| 81 | + |
| 82 | +`bf16_gemm_f32` is a triple-nested scalar loop with per-element `.to_f32()` upcast. No `crate::simd::*` types, no F32x16 mul_add, no FMA. This is the function `matmul_bf16_to_f32` falls back to — so the entire BF16 GEMM public surface bottoms out in scalar. |
| 83 | + |
| 84 | +**Hit on every CPU:** even AVX-512F-only Skylake-X loses the F32x16 mul_add (16-wide FMA per instr) that would lift this 16×. |
| 85 | + |
| 86 | +### TD-T5 · `src/hpc/quantized.rs:618-630` · scalar kernel labeled GEMM |
| 87 | + |
| 88 | +`int8_gemm_i32` is a triple-nested scalar loop. The VNNI dispatch path `int8_gemm_vnni` (lines 46-61 of `vnni_gemm.rs`) exists and is correct (uses `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`), but it's a separate symbol — nothing routes the public `int8_gemm_i32` callers through it. |
| 89 | + |
| 90 | +### TD-T6 · `src/backend/native.rs:544-561` · scalar-only "avx2" implementations |
| 91 | + |
| 92 | +The `avx2` module's `scal_f32`, `scal_f64`, `nrm2_f32`, `nrm2_f64`, `asum_f32`, `asum_f64` all unconditionally delegate to `super::scalar::*`. The dispatch macro thinks it's dispatching to AVX2 but the body is scalar. |
| 93 | + |
| 94 | +```rust |
| 95 | +pub fn scal_f32(alpha: f32, x: &mut [f32]) { |
| 96 | + super::scalar::scal_f32(alpha, x); // ← line 545 |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +Effect: on Haswell–Coffee Lake / Zen 1-3 (AVX2 + FMA but no AVX-512), all of `scal_*`, `nrm2_*`, `asum_*` run scalar. The dispatch macro at lines 92-165 routes through `avx2::name()` which is itself scalar. |
| 101 | + |
| 102 | +### TD-T7 · `src/backend/native.rs:271-278` · GEMV scalar everywhere |
| 103 | + |
| 104 | +`gemv_f32` and `gemv_f64` skip the `dispatch!` macro entirely and call `scalar::gemv_*` unconditionally. No AVX-512, no AVX2, no NEON. Every consumer of the backend GEMV path runs the scalar nested loop on every CPU. |
| 105 | + |
| 106 | +```rust |
| 107 | +pub fn gemv_f32(...) { |
| 108 | + scalar::gemv_f32(...); // ← line 272 |
| 109 | +} |
| 110 | +``` |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## Findings — HIGH |
| 115 | + |
| 116 | +### TD-T8 · `src/hpc/simd_dispatch.rs:150-163` · aarch64 dispatch = scalar |
| 117 | + |
| 118 | +```rust |
| 119 | +#[cfg(target_arch = "aarch64")] |
| 120 | +fn detect() -> Self { |
| 121 | + let caps = simd_caps(); |
| 122 | + let tier = if caps.asimd_dotprod { SimdTier::NeonDotProd } else { SimdTier::Neon }; |
| 123 | + // NEON uses the same scalar wrapper signatures — NEON intrinsics |
| 124 | + // will be wired when simd_neon.rs types are activated. For now, |
| 125 | + // dispatch to scalar which auto-vectorizes well on aarch64 with |
| 126 | + // `-C target-feature=+neon` (mandatory on aarch64). |
| 127 | + Self { tier, ..Self::scalar() } |
| 128 | +} |
| 129 | +``` |
| 130 | + |
| 131 | +The frozen dispatch table reports `NeonDotProd` or `Neon` tier to consumers but every function pointer in the struct is the scalar wrapper. Pi 5 / Pi 4 / M2 get the scalar implementations for `byte_find_all`, `byte_count`, `squared_distances_f32`, `nibble_unpack`, `nibble_above_threshold`, `batch_sq_dist`. |
| 132 | + |
| 133 | +### TD-T9 · `src/hpc/simd_dispatch.rs:128-134` · AVX-512 dispatch falls to AVX2 wrappers |
| 134 | + |
| 135 | +Even when `caps.avx512bw` is true, the AVX-512 tier branch fills in 4 of 6 function pointers with AVX2 wrappers: |
| 136 | + |
| 137 | +```rust |
| 138 | +if caps.avx512bw { |
| 139 | + Self { |
| 140 | + tier: SimdTier::Avx512, |
| 141 | + byte_find_all: byte_find_all_avx512_wrapper, // ← real |
| 142 | + byte_count: byte_count_avx512_wrapper, // ← real |
| 143 | + squared_distances_f32: squared_distances_avx2_wrapper, // ← AVX2! |
| 144 | + nibble_unpack: nibble_unpack_avx2_wrapper, // ← AVX2! |
| 145 | + nibble_above_threshold: nibble_above_threshold_avx2_wrapper, // ← AVX2! |
| 146 | + batch_sq_dist: batch_sq_dist_avx2_wrapper, // ← AVX2! |
| 147 | + } |
| 148 | +} |
| 149 | +``` |
| 150 | + |
| 151 | +Comment at line 130 admits `// no avx512 variant for 3D dist`. For `nibble_*`, the variant is missing per TD-T17. |
| 152 | + |
| 153 | +### TD-T10 · `src/simd_neon_bf16.rs:149-177` · stub structs that panic |
| 154 | + |
| 155 | +`BF16x8Stub` (line 149) and `BF16x16Stub` (line 156) are placeholder structs whose only method is `unimplemented()` panicking with the message documenting the BFMMLA / BFDOT asm-byte encoding still to wire up: `BFMMLA = 0x6e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`, `BFDOT = 0x4e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`. Module docs at lines 187-204 spell out the implementation plan; nothing is wired. |
| 156 | + |
| 157 | +**Hit on Pi 5 A76, Apple M2+, Snapdragon 8 Gen 2+:** consumers reaching for BF16 NEON ops panic or fall through to scalar `simd_half::BF16x16`. |
| 158 | + |
| 159 | +### TD-T11 · `src/simd_neon_dotprod.rs:115-148` · F16x16 stub |
| 160 | + |
| 161 | +`F16x16Stub` (line 136) is a placeholder; `unimplemented()` panics (line 141-147). Module docs at lines 96-113 give the full intrinsic map (`vfmaq_f16`, `vaddvq_f16`, `vsqrtq_f16`, `vcgtq_f16`) and the stable-Rust asm-byte encoding `0x0e40_cc20` for `fmla v0.8h, v1.8h, v2.8h`. |
| 162 | + |
| 163 | +**Hit on Pi 5 A76, Apple M-series:** consumers reaching `crate::simd::F16x16` get `simd_avx2::F16Scaler` scalar polyfill (line 134 comment) or `simd_nightly::F16x16`. |
| 164 | + |
| 165 | +### TD-T12 · `src/simd.rs:18-26` + `:49-88` · top-level Tier enum collapses |
| 166 | + |
| 167 | +```rust |
| 168 | +enum Tier { |
| 169 | + Avx512 = 1, |
| 170 | + Avx2 = 2, |
| 171 | + NeonDotProd = 3, |
| 172 | + Neon = 4, |
| 173 | + Scalar = 5, |
| 174 | +} |
| 175 | + |
| 176 | +fn detect_tier() -> Tier { |
| 177 | + if is_x86_feature_detected!("avx512f") { return Tier::Avx512; } |
| 178 | + if is_x86_feature_detected!("avx2") { return Tier::Avx2; } |
| 179 | + ... |
| 180 | +} |
| 181 | +``` |
| 182 | + |
| 183 | +Skylake-X (no VNNI / VBMI / BF16 / FP16 / AMX) and Granite Rapids (all of them) both → `Tier::Avx512`. Arrow Lake (`avxvnniint8`, no AVX-512F) → `Tier::Avx2`. Every caller of `tier()` (line 97) gets a coarse answer. |
| 184 | + |
| 185 | +Mitigation: `simd_caps()` at `src/hpc/simd_caps.rs:98` exists with 20 per-feature bits — but it's a separate dispatch channel, and consumers who use `tier()` don't see the sub-features. |
| 186 | + |
| 187 | +### TD-T13 · `src/backend/native.rs:22-26` · second Tier enum, same collapse |
| 188 | + |
| 189 | +Backend defines its own `Tier { Avx512, Avx2, Scalar }` enum (line 21-26), independent of the one in `simd.rs:18`. Same 3-bucket collapse. Same lack of VNNI / VBMI / BF16 / FP16 / AMX awareness. |
| 190 | + |
| 191 | +### TD-T14 · `src/hpc/simd_dispatch.rs:30-49` · third Tier enum, same collapse |
| 192 | + |
| 193 | +`SimdTier { Avx512, Avx2, Sse2, NeonDotProd, Neon, Scalar, WasmSimd128 }` — 7 variants, but `detect()` at lines 121-148 only branches on `caps.avx512bw` and `caps.avx2`. SSE2 never selected. No AVX-512-VNNI / VBMI / BF16 / FP16 / AMX paths. |
| 194 | + |
| 195 | +Three independent Tier enums total (TD-T12, TD-T13, TD-T14). |
| 196 | + |
| 197 | +### TD-T15 · `src/simd.rs:291-292 + 531-532` · BF16x16 polyfill-not-max under default config |
| 198 | + |
| 199 | +```rust |
| 200 | +// 291: hardware-native, ONLY if compile-time avx512bf16 is on |
| 201 | +#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16", not(feature = "nightly-simd")))] |
| 202 | +pub use crate::simd_avx512::{BF16x16, BF16x8}; |
| 203 | + |
| 204 | +// 531: scalar polyfill, the default |
| 205 | +#[cfg(all(feature = "std", not(all(target_arch = "x86_64", target_feature = "avx512bf16"))))] |
| 206 | +pub use crate::simd_half::BF16x16; |
| 207 | +``` |
| 208 | + |
| 209 | +The cargo default is `x86-64-v3` (per `.cargo/config.toml:25`), which is AVX2 only — no AVX-512F, definitely no avx512bf16. So even on Sapphire Rapids / Zen 4 silicon under default cargo, `crate::simd::BF16x16` resolves to scalar `simd_half::BF16x16`. |
| 210 | + |
| 211 | +Compile-time gate where runtime dispatch would lift the entire AVX-512 + BF16 install base out of the scalar polyfill. |
| 212 | + |
| 213 | +--- |
| 214 | + |
| 215 | +## Findings — MEDIUM |
| 216 | + |
| 217 | +### TD-T16 · `src/hpc/nibble.rs:23-41, 227-237` · nibble ops cap at AVX2 |
| 218 | + |
| 219 | +`nibble_unpack` (line 23) and `nibble_above_threshold` (line 227) check `caps.avx2` only — no AVX-512 path. Sapphire Rapids / Ice Lake / Zen 4 process 32 nibbles per AVX2 iteration when 64 per AVX-512BW iteration would be possible. |
| 220 | + |
| 221 | +### TD-T17 · `src/hpc/nibble.rs:59-94, 169-189, 257-278` · "AVX2" funcs are scalar loops |
| 222 | + |
| 223 | +`nibble_unpack_avx2` (line 59), `nibble_sub_clamp_avx2` (line 170), `nibble_above_threshold_avx2` (line 258) all carry `#[target_feature(enable = "avx2")]` but their bodies are plain scalar loops: |
| 224 | + |
| 225 | +```rust |
| 226 | +#[target_feature(enable = "avx2")] |
| 227 | +pub(crate) unsafe fn nibble_unpack_avx2(packed: &[u8], count: usize, out: &mut Vec<u8>) { |
| 228 | + // ... |
| 229 | + for j in 0..16 { |
| 230 | + lo[j] = data[j] & 0x0F; // ← scalar loop |
| 231 | + hi[j] = (data[j] >> 4) & 0x0F; |
| 232 | + } |
| 233 | + // ... |
| 234 | +} |
| 235 | +``` |
| 236 | + |
| 237 | +The autovectorizer may emit reasonable code, but this is not true `_mm256_*` intrinsics. `nibble_sub_clamp_avx512` at line 197 IS real (uses `U8x64::saturating_sub`). So nibble has one real SIMD path and two pretend-SIMD paths. |
| 238 | + |
| 239 | +### TD-T18 · `src/simd.rs:479-486` · simd_ln_f32 is a scalar loop |
| 240 | + |
| 241 | +```rust |
| 242 | +pub fn simd_ln_f32(x: F32x16) -> F32x16 { |
| 243 | + let arr = x.to_array(); |
| 244 | + let mut out = [0.0f32; 16]; |
| 245 | + for i in 0..16 { |
| 246 | + out[i] = arr[i].ln(); // ← scalar per-lane |
| 247 | + } |
| 248 | + F32x16::from_array(out) |
| 249 | +} |
| 250 | +``` |
| 251 | + |
| 252 | +`simd_exp_f32` at lines 419-450 is a real Remez polynomial with FMA via `mul_add` chain. `simd_ln_f32` is its asymmetric scalar twin. A consumer thinking they're getting 16-wide log gets 16× scalar `ln`. |
| 253 | + |
| 254 | +### TD-T19 · `src/hpc/distance.rs:101` · single tier, no AVX-512 |
| 255 | + |
| 256 | +The 3D `squared_distances` function checks `caps.avx2` only — line 101: `if super::simd_caps::simd_caps().avx2`. No AVX-512F variant. Sapphire Rapids etc. fall to AVX2 8-wide instead of AVX-512 16-wide. |
| 257 | + |
| 258 | +### TD-T20 · `src/hpc/spatial_hash.rs:273` · same as TD-T19 |
| 259 | + |
| 260 | +`batch_sq_dist` checks `caps.avx2` only. No AVX-512F variant. |
| 261 | + |
| 262 | +### TD-T21 · `src/simd.rs:351-354` · aarch64 integers come from scalar |
| 263 | + |
| 264 | +```rust |
| 265 | +#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))] |
| 266 | +pub use scalar::{ |
| 267 | + f32x8, f64x4, i32x16, i32x8, i64x4, i64x8, u16x16, u32x16, u32x8, u64x4, u64x8, u8x64, |
| 268 | + F32x8, F64x4, I32x16, I32x8, I64x4, I64x8, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64, |
| 269 | +}; |
| 270 | +``` |
| 271 | + |
| 272 | +On aarch64, the only types from `simd_neon::aarch64_simd` (line 349) are `f32x16, f64x8, F32Mask16, F32x16, F64Mask8, F64x8`. Every integer width — `I32x16`, `I8x32`, `U8x64`, `U16x32`, etc. — comes from `scalar::*`. Pi 5 / M2 get scalar integer SIMD even though NEON has `int32x4_t`, `uint8x16_t`, etc. |
| 273 | + |
| 274 | +### TD-T22 · `src/simd.rs:310, 318-321` · 256-bit int types in AVX2 build come from `simd_avx512` |
| 275 | + |
| 276 | +```rust |
| 277 | +// 310: AVX2-baseline arm uses simd_avx512 for the 256-bit shapes |
| 278 | +pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32}; |
| 279 | +``` |
| 280 | + |
| 281 | +Inverted naming: `I32x8` / `U32x8` / `I64x4` / `U64x4` (the natural AVX2 widths) come from `simd_avx2.rs` (which polyfills them as scalar storage with `[u32; 8]` arrays per the comment in the AMX matmul work), not from native `__m256i`. The polyfill IS the AVX2 module on AVX2 builds — verify whether the AVX2 module's polyfills wrap real `_mm256_*` intrinsics or scalar arrays. (Audit pending — requires reading `src/simd_avx2.rs`.) |
| 282 | + |
| 283 | +--- |
| 284 | + |
| 285 | +## Verified — code is correct (rejected agent claims) |
| 286 | + |
| 287 | +### `src/simd_amx.rs:282-301` · AVX-VNNI-INT8 dispatch IS done |
| 288 | + |
| 289 | +`matvec_dispatch` correctly routes `is_x86_feature_detected!("avxvnniint8")` to `vnni2_matvec` (256-bit VPDPBUSD path) when avx512vnni is absent. No debt. |
| 290 | + |
| 291 | +### `src/simd_avx512.rs:695-710` · VBMI dispatch IS done |
| 292 | + |
| 293 | +`permute_bytes` checks `if simd_caps().avx512vbmi { permute_bytes_vbmi(...) } else { scalar }`. Native `_mm512_permutexvar_epi8` reaches Ice Lake / SPR / Zen 4. The scalar branch is correct fallback for Skylake-X / Cascade Lake / Cooper Lake. No debt. |
| 294 | + |
| 295 | +### `src/hpc/bgz17_bridge.rs:43-86` · multi-versioning is correct |
| 296 | + |
| 297 | +5 dispatch sites at lines 78, 142, 197, 250, 349 each route `avx512f → avx2 → scalar` with proper `#[target_feature]` annotations on the inner functions. The `is_x86_feature_detected!("avx512f")` vs `avx2` granularity is appropriate for L1 absolute-difference kernels — VNNI / VBMI / BF16 don't help on `abs(a-b)` reductions. No tier-collapse debt here. |
| 298 | + |
| 299 | +### `src/hpc/vnni_gemm.rs:46-61` · VNNI dispatch is correct |
| 300 | + |
| 301 | +`int8_gemm_vnni` checks `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`. The only debt is that other paths (TD-T3, TD-T5) don't route through this function. |
| 302 | + |
| 303 | +### `src/hpc/p64_bridge.rs:109` · VPOPCNTDQ dispatch is correct |
| 304 | + |
| 305 | +`simd_caps().avx512vpopcntdq` runtime-detected. No debt at this site. |
| 306 | + |
| 307 | +### `src/hpc/cam_pq.rs:202, 215` · AVX-512F dispatch is correct |
| 308 | + |
| 309 | +`simd_caps().avx512f` runtime-detected. No debt at this site. |
| 310 | + |
| 311 | +### `src/hpc/aabb.rs:284, 440` · AVX-512F + SSE2 dispatch present |
| 312 | + |
| 313 | +Dispatches `avx512f` then falls to `sse2`. Missing intermediate AVX2 — `aabb` uses AVX-512F at one tier and SSE2 at the other. Probably acceptable for AABB (BV-shape ops), but a sub-finding to investigate on next sweep. |
| 314 | + |
| 315 | +--- |
| 316 | + |
| 317 | +## Prioritized action list |
| 318 | + |
| 319 | +| ID | Severity | Effort | Description | |
| 320 | +|---|---|---|---| |
| 321 | +| TD-T1 | CRIT | 1h | Wire `matmul_bf16_to_f32` to `bf16_tile_gemm_16x16` | |
| 322 | +| TD-T2 | CRIT | 30m (after T1) | Same for `matmul_f32` | |
| 323 | +| TD-T3 | CRIT | 1.5h | Wire `matmul_i8_to_i32` to AMX tile / VNNI fallback | |
| 324 | +| TD-T5 | CRIT | 30m | Route `int8_gemm_i32` callers through `int8_gemm_vnni` | |
| 325 | +| TD-T4 | CRIT | 3-4h | Rewrite `bf16_gemm_f32` with F32x16 mul_add + tiling | |
| 326 | +| TD-T6 | CRIT | 2h | Implement `avx2::{scal,nrm2,asum}_*` with real AVX2 intrinsics | |
| 327 | +| TD-T7 | CRIT | 2h | Implement `gemv_f32`/`gemv_f64` with tier dispatch | |
| 328 | +| TD-T8 | HIGH | 4-6h | Wire `simd_dispatch.rs` aarch64 tier to real NEON impls | |
| 329 | +| TD-T9 | HIGH | 2-3h | Add AVX-512 variants for `squared_distances`, `nibble_*`, `batch_sq_dist` | |
| 330 | +| TD-T10 | HIGH | 3-4h | Implement `BF16x8/16` NEON via asm-byte BFMMLA/BFDOT | |
| 331 | +| TD-T11 | HIGH | 3-4h | Implement `F16x16` NEON via asm-byte fmla v.8h | |
| 332 | +| TD-T15 | HIGH | 4-6h | Convert `BF16x16` from compile-time `target_feature` gate to runtime dispatch | |
| 333 | +| TD-T16 | MED | 1.5h | Add AVX-512BW variants for `nibble_unpack` / `nibble_above_threshold` | |
| 334 | +| TD-T17 | MED | 2h | Replace scalar-loop "avx2" funcs in nibble with `_mm256_*` intrinsics | |
| 335 | +| TD-T18 | MED | 2h | Rewrite `simd_ln_f32` as real Remez polynomial like `simd_exp_f32` | |
| 336 | +| TD-T19 | MED | 1h | Add AVX-512F path to `distance::squared_distances_f32` | |
| 337 | +| TD-T20 | MED | 1h | Same for `spatial_hash::batch_sq_dist` | |
| 338 | +| TD-T21 | HIGH | 8-12h | Replace aarch64 scalar integer types in `simd.rs` with NEON impls | |
| 339 | +| TD-T22 | – | – | Investigation only — needs `simd_avx2.rs` read first | |
| 340 | +| TD-T12/T13/T14 | HIGH | (audit-wide) | Consolidate three Tier enums OR route all callers through `simd_caps()` for sub-feature dispatch | |
| 341 | + |
| 342 | +## Next-sweep targets (unread) |
| 343 | + |
| 344 | +These files are listed in the dispatch site grep but not yet read for this audit. Findings in them are unverified: |
| 345 | + |
| 346 | +- Full `src/simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`, `simd_scalar.rs`, `simd_half.rs` |
| 347 | +- HPC SIMD-consuming: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs` |
| 348 | +- HPC suspected scalar: `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs` |
| 349 | +- HPC dispatch sites with `is_x86_feature_detected!`: `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs` |
| 350 | +- All 34 `src/hpc/styles/*` primitives |
| 351 | + |
| 352 | +The most likely-debt-rich unread targets: |
| 353 | + |
| 354 | +1. `src/hpc/blas_level{1,2,3}.rs` — grep showed NO use of `crate::simd::*` types. The flagship BLAS public API may be entirely scalar (separate audit needed). |
| 355 | +2. `src/hpc/statistics.rs`, `lapack.rs` — same, no `crate::simd::*` use. |
| 356 | +3. `src/simd_avx2.rs` — the 256-bit polyfills for 512-bit types. TD-T22 needs this read to know whether the polyfills are real `__m256i` intrinsics or scalar arrays under `#[target_feature]`. |
| 357 | + |
0 commit comments