Skip to content

Commit ca82bb7

Browse files
committed
docs(simd): tier audit — find debt where polyfill ≠ max-per-silicon
22 verified findings across 7 files, in `.claude/knowledge/td-simd-tier-audit.md`. Read end-to-end for this pass: src/simd.rs (720 LoC), simd_amx.rs (421), simd_dispatch.rs (361), hpc/amx_matmul.rs (671), hpc/bf16_tile_gemm.rs (205), hpc/simd_caps.rs (514), backend/native.rs (763), plus relevant sections of simd_avx512.rs, simd_neon_bf16.rs, simd_neon_dotprod.rs, hpc/bgz17_bridge.rs, hpc/nibble.rs, hpc/quantized.rs, hpc/vnni_gemm.rs. Headline findings (CRITICAL): TD-T1/T2/T3 — `matmul_bf16_to_f32`, `matmul_f32`, `matmul_i8_to_i32` in `hpc/amx_matmul.rs:319-412` all have `if amx_available()` branches whose AMX arm calls the scalar reference kernel. `bf16_tile_gemm_16x16` (the real TDPBF16PS dispatch) and `int8_gemm_vnni` (real VPDPBUSD dispatch) exist and are tested — they're just not wired through. TD-T4 — `quantized.rs::bf16_gemm_f32` (line 444) is a triple-nested scalar loop with per-element `.to_f32()`. No `crate::simd::*`, no F32x16 mul_add. This IS the fallback that everything bottoms out at. TD-T5 — `quantized.rs::int8_gemm_i32` (line 618) is the same shape. `int8_gemm_vnni` exists at `vnni_gemm.rs:46` but no caller routes through it. TD-T6 — `backend/native.rs::avx2::{scal,nrm2,asum}_*` (line 544-561) all `super::scalar::*`-delegate. Dispatch macro thinks AVX2; body is scalar. TD-T7 — `backend/native.rs::gemv_f32/f64` (line 271-278) skip the dispatch macro entirely. Scalar on every CPU. Headline findings (HIGH): TD-T8 — `simd_dispatch.rs:150-163` aarch64 tier reports `NeonDotProd`/`Neon` but fills every fn pointer from `Self::scalar()`. Pi 5 / M2 / Apple silicon = scalar. TD-T9 — `simd_dispatch.rs:128-134` AVX-512 tier uses AVX2 wrappers for 4 of 6 ops (squared_distances_f32, nibble_unpack, nibble_above_threshold, batch_sq_dist). TD-T10/T11 — `simd_neon_bf16.rs:149-177` and `simd_neon_dotprod.rs:115-148` have `*Stub` types with `unimplemented!()` panic bodies. Module docs spell out the asm-byte BFMMLA/BFDOT/fmla.8h encodings; nothing wired. TD-T12/T13/T14 — three independent `Tier` enums (simd.rs, backend/native.rs, hpc/simd_dispatch.rs) all collapse Skylake-X through Granite Rapids into one `Avx512` bucket. TD-T15 — `simd.rs:291-292` gates `BF16x16` re-export on compile-time `target_feature = "avx512bf16"`. Default cargo is v3; even on SPR/Zen 4 silicon, `crate::simd::BF16x16` is the scalar `simd_half::BF16x16`. Headline findings (MEDIUM): TD-T16/T17 — `hpc/nibble.rs` ops cap at AVX2 (no AVX-512BW for nibble_unpack / nibble_above_threshold), and the "avx2" funcs are scalar loops under `#[target_feature(enable = "avx2")]`. TD-T18 — `simd_ln_f32` (simd.rs:479) is `arr[i].ln()` per lane, asymmetric with the Remez-polynomial `simd_exp_f32`. TD-T19/T20 — `distance::squared_distances` and `spatial_hash::batch_sq_dist` check only `avx2`; no AVX-512F variant. TD-T21 — `simd.rs:351-354` on aarch64, every integer type (`I32x16`, `U8x64`, `U16x32`, ...) comes from `scalar::*`, not `simd_neon`. Pi 5 / M2 get scalar integer SIMD despite NEON int32x4_t / uint8x16_t being available. Two agent-claimed findings verified false (`simd_amx.rs:291`, `simd_avx512.rs:695`) — recorded as such in the audit so they don't get re-flagged on the next sweep. Next-sweep targets listed at the bottom — `blas_level{1,2,3}.rs`, `statistics.rs`, `lapack.rs`, `simd_avx2.rs`, `simd_neon.rs` unread. Grep showed no `crate::simd::*` use in the BLAS / LAPACK / statistics files; flagship public API may be entirely scalar. Verification requires actually reading them, not grep — same mistake the agent made. No code changes in this commit. Documentation only.
1 parent 2ef97c0 commit ca82bb7

1 file changed

Lines changed: 357 additions & 0 deletions

File tree

Lines changed: 357 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
# SIMD Tier Technical Debt Audit
2+
3+
> **Design principle:** `crate::simd::*` exposes the maximum hardware performance available on the current silicon via runtime-detected polyfill. Every CPU trick is applied at its tier. Consumers must never get scalar code when the hardware offers a SIMD path. The polyfill **is** the dispatch layer, not the fallback.
4+
5+
## Audit scope (2026-05-20)
6+
7+
Every finding below was verified by reading the file at the cited line range. Files read end-to-end or in dispatch-relevant sections:
8+
9+
| File | LoC read | Why |
10+
|---|---|---|
11+
| `src/simd.rs` | 1–720 (full) | Top-level dispatch |
12+
| `src/simd_amx.rs` | 1–421 (full) | AMX detection + VNNI dispatch |
13+
| `src/hpc/amx_matmul.rs` | 1–671 (full) | Public ndarray-typed matmul API |
14+
| `src/hpc/bf16_tile_gemm.rs` | 1–205 (full) | AMX tile kernel |
15+
| `src/hpc/simd_caps.rs` | 1–514 (full) | Capability singleton |
16+
| `src/hpc/simd_dispatch.rs` | 1–361 (full) | Frozen dispatch table |
17+
| `src/backend/native.rs` | 1–763 (full) | Backend BLAS-1 + GEMM dispatch |
18+
| `src/backend/kernels_avx512.rs` | 1–100 + grep | AVX-512 BLAS-1 kernels |
19+
| `src/simd_neon_bf16.rs:130–204` | stub section | BF16 NEON stubs |
20+
| `src/simd_neon_dotprod.rs:96–157` | stub section | F16 NEON stub |
21+
| `src/simd_avx512.rs:680–720, 2360–2420` | VBMI + BF16 conv | VBMI permute, BF16 batch |
22+
| `src/hpc/bgz17_bridge.rs:35–135` | dispatch sites | bgz17 L1 kernels |
23+
| `src/hpc/nibble.rs:1–270` | dispatch sites | Nibble ops |
24+
| `src/hpc/quantized.rs:444–630` | GEMM kernels | bf16/int8 GEMM |
25+
| `src/hpc/vnni_gemm.rs:1–130` | VNNI INT8 GEMM | VNNI dispatch |
26+
27+
Files NOT yet read for this audit (next sweep):
28+
29+
- `src/simd_avx512.rs` remainder (~3700 LoC unread)
30+
- `src/simd_avx2.rs` (2805 LoC unread)
31+
- `src/simd_neon.rs` (1917 LoC unread)
32+
- `src/simd_scalar.rs` (1308 LoC unread)
33+
- `src/simd_half.rs` (762 LoC unread)
34+
- `src/simd_nightly/*`
35+
- HPC modules: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs`, `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs`, `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs`, all of `src/hpc/styles/*`
36+
37+
## Microscopic silicon tier matrix
38+
39+
| CPU | AVX-512F | VNNI | VBMI | BF16 | FP16 | AMX-INT8 | AMX-BF16 | AVX-VNNI-INT8 |
40+
|---|---|---|---|---|---|---|---|---|
41+
| Skylake-X / SP / W (2017) |||||||||
42+
| Cascade Lake (2019) |||||||||
43+
| Cooper Lake (2020) |||||||||
44+
| Ice Lake-SP / Tiger Lake (2021) |||||||||
45+
| Sapphire Rapids (2023) |||||||||
46+
| Granite Rapids (2024) |||||||| (+ AMX-FP16, AMX-COMPLEX) |
47+
| Zen 4 (Genoa, Ryzen 7000, 2022) |||||||||
48+
| Zen 5 (2024) |||||||||
49+
| Arrow Lake / Lunar Lake (2024) |||||||||
50+
| Pi 5 / Orange Pi 5 (A76, ARMv8.2) | (NEON) | (dotprod) || (bf16+) | (fp16) ||||
51+
| Pi 4 (A72, ARMv8.0) | (NEON) ||||||||
52+
53+
---
54+
55+
## Findings — CRITICAL
56+
57+
### TD-T1 · `src/hpc/amx_matmul.rs:319-327` · stub fast path
58+
59+
`matmul_bf16_to_f32`: `if amx_available() { bf16_gemm_f32(...) } else { bf16_gemm_f32(...) }` — both arms identical. Comment 320-323 admits "Future: AMX-tiled fast path. Today we route through the same f32 reference kernel; correctness is identical regardless of hardware. The `amx_available()` branch is preserved so callers can be sure the AMX detection runs."
60+
61+
Working AMX kernel exists at `src/hpc/bf16_tile_gemm.rs::bf16_tile_gemm_16x16` (lines 39-87) — full TDPBF16PS dispatch, tested at lines 151-204.
62+
63+
**Hit on Sapphire Rapids / Granite Rapids:** scalar instead of 256-mul-add/instr tile op.
64+
65+
### TD-T2 · `src/hpc/amx_matmul.rs:351-356` · stub fast path
66+
67+
`matmul_f32` AMX branch: converts f32 → BF16 and calls `bf16_gemm_f32` (scalar). Same shape as TD-T1.
68+
69+
### TD-T3 · `src/hpc/amx_matmul.rs:395-412` · stub fast path + wrong fallback
70+
71+
`matmul_i8_to_i32` AMX branch: shifts LHS i8 → u8 (+128) and calls `int8_gemm_i32` (scalar reference).
72+
73+
Two debts here:
74+
1. AMX path never reaches `tile_dpbusd` (the working primitive at `amx_matmul.rs:146-150`).
75+
2. The fallback when AMX is absent should be `int8_gemm_vnni` (at `src/hpc/vnni_gemm.rs:46`), which dispatches AVX-512 VNNI `VPDPBUSD` (64 MACs/instr) — but it calls the scalar `int8_gemm_i32` directly.
76+
77+
**Hit on Sapphire Rapids:** ~256× slower than AMX TDPBUSD.
78+
**Hit on Cascade Lake / Ice Lake-SP / Zen 4 (AVX-512 + VNNI but no AMX):** ~64× slower than VNNI.
79+
80+
### TD-T4 · `src/hpc/quantized.rs:444-481` · scalar kernel labeled GEMM
81+
82+
`bf16_gemm_f32` is a triple-nested scalar loop with per-element `.to_f32()` upcast. No `crate::simd::*` types, no F32x16 mul_add, no FMA. This is the function `matmul_bf16_to_f32` falls back to — so the entire BF16 GEMM public surface bottoms out in scalar.
83+
84+
**Hit on every CPU:** even AVX-512F-only Skylake-X loses the F32x16 mul_add (16-wide FMA per instr) that would lift this 16×.
85+
86+
### TD-T5 · `src/hpc/quantized.rs:618-630` · scalar kernel labeled GEMM
87+
88+
`int8_gemm_i32` is a triple-nested scalar loop. The VNNI dispatch path `int8_gemm_vnni` (lines 46-61 of `vnni_gemm.rs`) exists and is correct (uses `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`), but it's a separate symbol — nothing routes the public `int8_gemm_i32` callers through it.
89+
90+
### TD-T6 · `src/backend/native.rs:544-561` · scalar-only "avx2" implementations
91+
92+
The `avx2` module's `scal_f32`, `scal_f64`, `nrm2_f32`, `nrm2_f64`, `asum_f32`, `asum_f64` all unconditionally delegate to `super::scalar::*`. The dispatch macro thinks it's dispatching to AVX2 but the body is scalar.
93+
94+
```rust
95+
pub fn scal_f32(alpha: f32, x: &mut [f32]) {
96+
super::scalar::scal_f32(alpha, x); // ← line 545
97+
}
98+
```
99+
100+
Effect: on Haswell–Coffee Lake / Zen 1-3 (AVX2 + FMA but no AVX-512), all of `scal_*`, `nrm2_*`, `asum_*` run scalar. The dispatch macro at lines 92-165 routes through `avx2::name()` which is itself scalar.
101+
102+
### TD-T7 · `src/backend/native.rs:271-278` · GEMV scalar everywhere
103+
104+
`gemv_f32` and `gemv_f64` skip the `dispatch!` macro entirely and call `scalar::gemv_*` unconditionally. No AVX-512, no AVX2, no NEON. Every consumer of the backend GEMV path runs the scalar nested loop on every CPU.
105+
106+
```rust
107+
pub fn gemv_f32(...) {
108+
scalar::gemv_f32(...); // ← line 272
109+
}
110+
```
111+
112+
---
113+
114+
## Findings — HIGH
115+
116+
### TD-T8 · `src/hpc/simd_dispatch.rs:150-163` · aarch64 dispatch = scalar
117+
118+
```rust
119+
#[cfg(target_arch = "aarch64")]
120+
fn detect() -> Self {
121+
let caps = simd_caps();
122+
let tier = if caps.asimd_dotprod { SimdTier::NeonDotProd } else { SimdTier::Neon };
123+
// NEON uses the same scalar wrapper signatures — NEON intrinsics
124+
// will be wired when simd_neon.rs types are activated. For now,
125+
// dispatch to scalar which auto-vectorizes well on aarch64 with
126+
// `-C target-feature=+neon` (mandatory on aarch64).
127+
Self { tier, ..Self::scalar() }
128+
}
129+
```
130+
131+
The frozen dispatch table reports `NeonDotProd` or `Neon` tier to consumers but every function pointer in the struct is the scalar wrapper. Pi 5 / Pi 4 / M2 get the scalar implementations for `byte_find_all`, `byte_count`, `squared_distances_f32`, `nibble_unpack`, `nibble_above_threshold`, `batch_sq_dist`.
132+
133+
### TD-T9 · `src/hpc/simd_dispatch.rs:128-134` · AVX-512 dispatch falls to AVX2 wrappers
134+
135+
Even when `caps.avx512bw` is true, the AVX-512 tier branch fills in 4 of 6 function pointers with AVX2 wrappers:
136+
137+
```rust
138+
if caps.avx512bw {
139+
Self {
140+
tier: SimdTier::Avx512,
141+
byte_find_all: byte_find_all_avx512_wrapper, // ← real
142+
byte_count: byte_count_avx512_wrapper, // ← real
143+
squared_distances_f32: squared_distances_avx2_wrapper, // ← AVX2!
144+
nibble_unpack: nibble_unpack_avx2_wrapper, // ← AVX2!
145+
nibble_above_threshold: nibble_above_threshold_avx2_wrapper, // ← AVX2!
146+
batch_sq_dist: batch_sq_dist_avx2_wrapper, // ← AVX2!
147+
}
148+
}
149+
```
150+
151+
Comment at line 130 admits `// no avx512 variant for 3D dist`. For `nibble_*`, the variant is missing per TD-T17.
152+
153+
### TD-T10 · `src/simd_neon_bf16.rs:149-177` · stub structs that panic
154+
155+
`BF16x8Stub` (line 149) and `BF16x16Stub` (line 156) are placeholder structs whose only method is `unimplemented()` panicking with the message documenting the BFMMLA / BFDOT asm-byte encoding still to wire up: `BFMMLA = 0x6e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`, `BFDOT = 0x4e40_ec00 | (Vm << 16) | (Vn << 5) | Vd`. Module docs at lines 187-204 spell out the implementation plan; nothing is wired.
156+
157+
**Hit on Pi 5 A76, Apple M2+, Snapdragon 8 Gen 2+:** consumers reaching for BF16 NEON ops panic or fall through to scalar `simd_half::BF16x16`.
158+
159+
### TD-T11 · `src/simd_neon_dotprod.rs:115-148` · F16x16 stub
160+
161+
`F16x16Stub` (line 136) is a placeholder; `unimplemented()` panics (line 141-147). Module docs at lines 96-113 give the full intrinsic map (`vfmaq_f16`, `vaddvq_f16`, `vsqrtq_f16`, `vcgtq_f16`) and the stable-Rust asm-byte encoding `0x0e40_cc20` for `fmla v0.8h, v1.8h, v2.8h`.
162+
163+
**Hit on Pi 5 A76, Apple M-series:** consumers reaching `crate::simd::F16x16` get `simd_avx2::F16Scaler` scalar polyfill (line 134 comment) or `simd_nightly::F16x16`.
164+
165+
### TD-T12 · `src/simd.rs:18-26` + `:49-88` · top-level Tier enum collapses
166+
167+
```rust
168+
enum Tier {
169+
Avx512 = 1,
170+
Avx2 = 2,
171+
NeonDotProd = 3,
172+
Neon = 4,
173+
Scalar = 5,
174+
}
175+
176+
fn detect_tier() -> Tier {
177+
if is_x86_feature_detected!("avx512f") { return Tier::Avx512; }
178+
if is_x86_feature_detected!("avx2") { return Tier::Avx2; }
179+
...
180+
}
181+
```
182+
183+
Skylake-X (no VNNI / VBMI / BF16 / FP16 / AMX) and Granite Rapids (all of them) both → `Tier::Avx512`. Arrow Lake (`avxvnniint8`, no AVX-512F) → `Tier::Avx2`. Every caller of `tier()` (line 97) gets a coarse answer.
184+
185+
Mitigation: `simd_caps()` at `src/hpc/simd_caps.rs:98` exists with 20 per-feature bits — but it's a separate dispatch channel, and consumers who use `tier()` don't see the sub-features.
186+
187+
### TD-T13 · `src/backend/native.rs:22-26` · second Tier enum, same collapse
188+
189+
Backend defines its own `Tier { Avx512, Avx2, Scalar }` enum (line 21-26), independent of the one in `simd.rs:18`. Same 3-bucket collapse. Same lack of VNNI / VBMI / BF16 / FP16 / AMX awareness.
190+
191+
### TD-T14 · `src/hpc/simd_dispatch.rs:30-49` · third Tier enum, same collapse
192+
193+
`SimdTier { Avx512, Avx2, Sse2, NeonDotProd, Neon, Scalar, WasmSimd128 }` — 7 variants, but `detect()` at lines 121-148 only branches on `caps.avx512bw` and `caps.avx2`. SSE2 never selected. No AVX-512-VNNI / VBMI / BF16 / FP16 / AMX paths.
194+
195+
Three independent Tier enums total (TD-T12, TD-T13, TD-T14).
196+
197+
### TD-T15 · `src/simd.rs:291-292 + 531-532` · BF16x16 polyfill-not-max under default config
198+
199+
```rust
200+
// 291: hardware-native, ONLY if compile-time avx512bf16 is on
201+
#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16", not(feature = "nightly-simd")))]
202+
pub use crate::simd_avx512::{BF16x16, BF16x8};
203+
204+
// 531: scalar polyfill, the default
205+
#[cfg(all(feature = "std", not(all(target_arch = "x86_64", target_feature = "avx512bf16"))))]
206+
pub use crate::simd_half::BF16x16;
207+
```
208+
209+
The cargo default is `x86-64-v3` (per `.cargo/config.toml:25`), which is AVX2 only — no AVX-512F, definitely no avx512bf16. So even on Sapphire Rapids / Zen 4 silicon under default cargo, `crate::simd::BF16x16` resolves to scalar `simd_half::BF16x16`.
210+
211+
Compile-time gate where runtime dispatch would lift the entire AVX-512 + BF16 install base out of the scalar polyfill.
212+
213+
---
214+
215+
## Findings — MEDIUM
216+
217+
### TD-T16 · `src/hpc/nibble.rs:23-41, 227-237` · nibble ops cap at AVX2
218+
219+
`nibble_unpack` (line 23) and `nibble_above_threshold` (line 227) check `caps.avx2` only — no AVX-512 path. Sapphire Rapids / Ice Lake / Zen 4 process 32 nibbles per AVX2 iteration when 64 per AVX-512BW iteration would be possible.
220+
221+
### TD-T17 · `src/hpc/nibble.rs:59-94, 169-189, 257-278` · "AVX2" funcs are scalar loops
222+
223+
`nibble_unpack_avx2` (line 59), `nibble_sub_clamp_avx2` (line 170), `nibble_above_threshold_avx2` (line 258) all carry `#[target_feature(enable = "avx2")]` but their bodies are plain scalar loops:
224+
225+
```rust
226+
#[target_feature(enable = "avx2")]
227+
pub(crate) unsafe fn nibble_unpack_avx2(packed: &[u8], count: usize, out: &mut Vec<u8>) {
228+
// ...
229+
for j in 0..16 {
230+
lo[j] = data[j] & 0x0F; // ← scalar loop
231+
hi[j] = (data[j] >> 4) & 0x0F;
232+
}
233+
// ...
234+
}
235+
```
236+
237+
The autovectorizer may emit reasonable code, but this is not true `_mm256_*` intrinsics. `nibble_sub_clamp_avx512` at line 197 IS real (uses `U8x64::saturating_sub`). So nibble has one real SIMD path and two pretend-SIMD paths.
238+
239+
### TD-T18 · `src/simd.rs:479-486` · simd_ln_f32 is a scalar loop
240+
241+
```rust
242+
pub fn simd_ln_f32(x: F32x16) -> F32x16 {
243+
let arr = x.to_array();
244+
let mut out = [0.0f32; 16];
245+
for i in 0..16 {
246+
out[i] = arr[i].ln(); // ← scalar per-lane
247+
}
248+
F32x16::from_array(out)
249+
}
250+
```
251+
252+
`simd_exp_f32` at lines 419-450 is a real Remez polynomial with FMA via `mul_add` chain. `simd_ln_f32` is its asymmetric scalar twin. A consumer thinking they're getting 16-wide log gets 16× scalar `ln`.
253+
254+
### TD-T19 · `src/hpc/distance.rs:101` · single tier, no AVX-512
255+
256+
The 3D `squared_distances` function checks `caps.avx2` only — line 101: `if super::simd_caps::simd_caps().avx2`. No AVX-512F variant. Sapphire Rapids etc. fall to AVX2 8-wide instead of AVX-512 16-wide.
257+
258+
### TD-T20 · `src/hpc/spatial_hash.rs:273` · same as TD-T19
259+
260+
`batch_sq_dist` checks `caps.avx2` only. No AVX-512F variant.
261+
262+
### TD-T21 · `src/simd.rs:351-354` · aarch64 integers come from scalar
263+
264+
```rust
265+
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
266+
pub use scalar::{
267+
f32x8, f64x4, i32x16, i32x8, i64x4, i64x8, u16x16, u32x16, u32x8, u64x4, u64x8, u8x64,
268+
F32x8, F64x4, I32x16, I32x8, I64x4, I64x8, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64,
269+
};
270+
```
271+
272+
On aarch64, the only types from `simd_neon::aarch64_simd` (line 349) are `f32x16, f64x8, F32Mask16, F32x16, F64Mask8, F64x8`. Every integer width — `I32x16`, `I8x32`, `U8x64`, `U16x32`, etc. — comes from `scalar::*`. Pi 5 / M2 get scalar integer SIMD even though NEON has `int32x4_t`, `uint8x16_t`, etc.
273+
274+
### TD-T22 · `src/simd.rs:310, 318-321` · 256-bit int types in AVX2 build come from `simd_avx512`
275+
276+
```rust
277+
// 310: AVX2-baseline arm uses simd_avx512 for the 256-bit shapes
278+
pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32};
279+
```
280+
281+
Inverted naming: `I32x8` / `U32x8` / `I64x4` / `U64x4` (the natural AVX2 widths) come from `simd_avx2.rs` (which polyfills them as scalar storage with `[u32; 8]` arrays per the comment in the AMX matmul work), not from native `__m256i`. The polyfill IS the AVX2 module on AVX2 builds — verify whether the AVX2 module's polyfills wrap real `_mm256_*` intrinsics or scalar arrays. (Audit pending — requires reading `src/simd_avx2.rs`.)
282+
283+
---
284+
285+
## Verified — code is correct (rejected agent claims)
286+
287+
### `src/simd_amx.rs:282-301` · AVX-VNNI-INT8 dispatch IS done
288+
289+
`matvec_dispatch` correctly routes `is_x86_feature_detected!("avxvnniint8")` to `vnni2_matvec` (256-bit VPDPBUSD path) when avx512vnni is absent. No debt.
290+
291+
### `src/simd_avx512.rs:695-710` · VBMI dispatch IS done
292+
293+
`permute_bytes` checks `if simd_caps().avx512vbmi { permute_bytes_vbmi(...) } else { scalar }`. Native `_mm512_permutexvar_epi8` reaches Ice Lake / SPR / Zen 4. The scalar branch is correct fallback for Skylake-X / Cascade Lake / Cooper Lake. No debt.
294+
295+
### `src/hpc/bgz17_bridge.rs:43-86` · multi-versioning is correct
296+
297+
5 dispatch sites at lines 78, 142, 197, 250, 349 each route `avx512f → avx2 → scalar` with proper `#[target_feature]` annotations on the inner functions. The `is_x86_feature_detected!("avx512f")` vs `avx2` granularity is appropriate for L1 absolute-difference kernels — VNNI / VBMI / BF16 don't help on `abs(a-b)` reductions. No tier-collapse debt here.
298+
299+
### `src/hpc/vnni_gemm.rs:46-61` · VNNI dispatch is correct
300+
301+
`int8_gemm_vnni` checks `simd_caps().has_avx512_vnni()` and calls `int8_gemm_vnni_avx512`. The only debt is that other paths (TD-T3, TD-T5) don't route through this function.
302+
303+
### `src/hpc/p64_bridge.rs:109` · VPOPCNTDQ dispatch is correct
304+
305+
`simd_caps().avx512vpopcntdq` runtime-detected. No debt at this site.
306+
307+
### `src/hpc/cam_pq.rs:202, 215` · AVX-512F dispatch is correct
308+
309+
`simd_caps().avx512f` runtime-detected. No debt at this site.
310+
311+
### `src/hpc/aabb.rs:284, 440` · AVX-512F + SSE2 dispatch present
312+
313+
Dispatches `avx512f` then falls to `sse2`. Missing intermediate AVX2 — `aabb` uses AVX-512F at one tier and SSE2 at the other. Probably acceptable for AABB (BV-shape ops), but a sub-finding to investigate on next sweep.
314+
315+
---
316+
317+
## Prioritized action list
318+
319+
| ID | Severity | Effort | Description |
320+
|---|---|---|---|
321+
| TD-T1 | CRIT | 1h | Wire `matmul_bf16_to_f32` to `bf16_tile_gemm_16x16` |
322+
| TD-T2 | CRIT | 30m (after T1) | Same for `matmul_f32` |
323+
| TD-T3 | CRIT | 1.5h | Wire `matmul_i8_to_i32` to AMX tile / VNNI fallback |
324+
| TD-T5 | CRIT | 30m | Route `int8_gemm_i32` callers through `int8_gemm_vnni` |
325+
| TD-T4 | CRIT | 3-4h | Rewrite `bf16_gemm_f32` with F32x16 mul_add + tiling |
326+
| TD-T6 | CRIT | 2h | Implement `avx2::{scal,nrm2,asum}_*` with real AVX2 intrinsics |
327+
| TD-T7 | CRIT | 2h | Implement `gemv_f32`/`gemv_f64` with tier dispatch |
328+
| TD-T8 | HIGH | 4-6h | Wire `simd_dispatch.rs` aarch64 tier to real NEON impls |
329+
| TD-T9 | HIGH | 2-3h | Add AVX-512 variants for `squared_distances`, `nibble_*`, `batch_sq_dist` |
330+
| TD-T10 | HIGH | 3-4h | Implement `BF16x8/16` NEON via asm-byte BFMMLA/BFDOT |
331+
| TD-T11 | HIGH | 3-4h | Implement `F16x16` NEON via asm-byte fmla v.8h |
332+
| TD-T15 | HIGH | 4-6h | Convert `BF16x16` from compile-time `target_feature` gate to runtime dispatch |
333+
| TD-T16 | MED | 1.5h | Add AVX-512BW variants for `nibble_unpack` / `nibble_above_threshold` |
334+
| TD-T17 | MED | 2h | Replace scalar-loop "avx2" funcs in nibble with `_mm256_*` intrinsics |
335+
| TD-T18 | MED | 2h | Rewrite `simd_ln_f32` as real Remez polynomial like `simd_exp_f32` |
336+
| TD-T19 | MED | 1h | Add AVX-512F path to `distance::squared_distances_f32` |
337+
| TD-T20 | MED | 1h | Same for `spatial_hash::batch_sq_dist` |
338+
| TD-T21 | HIGH | 8-12h | Replace aarch64 scalar integer types in `simd.rs` with NEON impls |
339+
| TD-T22 ||| Investigation only — needs `simd_avx2.rs` read first |
340+
| TD-T12/T13/T14 | HIGH | (audit-wide) | Consolidate three Tier enums OR route all callers through `simd_caps()` for sub-feature dispatch |
341+
342+
## Next-sweep targets (unread)
343+
344+
These files are listed in the dispatch site grep but not yet read for this audit. Findings in them are unverified:
345+
346+
- Full `src/simd_avx512.rs`, `simd_avx2.rs`, `simd_neon.rs`, `simd_scalar.rs`, `simd_half.rs`
347+
- HPC SIMD-consuming: `vml.rs`, `activations.rs`, `reductions.rs`, `kernels.rs`, `fft.rs`
348+
- HPC suspected scalar: `statistics.rs`, `lapack.rs`, `blas_level{1,2,3}.rs`
349+
- HPC dispatch sites with `is_x86_feature_detected!`: `cam_pq.rs`, `palette_distance.rs`, `aabb.rs`, `distance.rs`, `bitwise.rs`, `p64_bridge.rs`, `spatial_hash.rs`, `jitson_cranelift/detect.rs`
350+
- All 34 `src/hpc/styles/*` primitives
351+
352+
The most likely-debt-rich unread targets:
353+
354+
1. `src/hpc/blas_level{1,2,3}.rs` — grep showed NO use of `crate::simd::*` types. The flagship BLAS public API may be entirely scalar (separate audit needed).
355+
2. `src/hpc/statistics.rs`, `lapack.rs` — same, no `crate::simd::*` use.
356+
3. `src/simd_avx2.rs` — the 256-bit polyfills for 512-bit types. TD-T22 needs this read to know whether the polyfills are real `__m256i` intrinsics or scalar arrays under `#[target_feature]`.
357+

0 commit comments

Comments
 (0)