Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 52 additions & 10 deletions .claude/knowledge/simd-dispatch-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,31 +144,73 @@ tracked as TD-SIMD-3.)
| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
|---|---|---|---|---|---|
| `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` |
| `F32x8` | ✅ `__m256` | | ⛔ | 🔵 | ✅ |
| `F32x8` | ✅ `__m256` | ✅ `__m256` (in `simd_avx512`) | ⛔ | 🔵 | ✅ |
| `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ |
| `F64x4` | ✅ `__m256d` | | ⛔ | 🔵 | ✅ |
| `F64x4` | ✅ `__m256d` | ✅ `__m256d` (in `simd_avx512`) | ⛔ | 🔵 | ✅ |
| `U8x64` | ✅ `__m512i` | 🟠 `[u8; 64]` polyfill | ❌ | 🔵 | ✅ |
| `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ |
| `U16x32` | ✅ `__m512i` | 🟠 `[u16; 32]` polyfill | ❌ | 🔵 | ✅ |
| `U16x16` | ❌ | ❌ | ❌ | ❌ | ❌ |
| `U32x16` | ✅ `__m512i` | 🟠 `[u32; 16]` polyfill | ❌ | 🔵 | ✅ |
| `U32x8` | ❌ | ❌ | ❌ | 🔵 `core::simd::u32x8` | ❌ |
| `U64x8` | ✅ `__m512i` | 🟠 `[u64; 8]` polyfill | ❌ | 🔵 | ✅ |
| `U64x4` | ❌ | ❌ | ❌ | 🔵 `core::simd::u64x4` | ❌ |
| `I8x32` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ |
| `I8x64` | ✅ `__m512i` | 🟠 `[i8; 64]` polyfill | ❌ | 🔵 | ✅ |
| `I16x16` | ✅ `__m256i` | ✅ `__m256i` (in `simd_avx512`) | ❌ | 🔵 | ✅ |
| `I16x32` | ✅ `__m512i` | 🟠 `[i16; 32]` polyfill | ❌ | 🔵 | ✅ |
| `I32x16` | ✅ `__m512i` | 🟠 `[i32; 16]` polyfill | ❌ | 🔵 | ✅ |
| `I32x8` | ❌ | ❌ | ❌ | ❌ | ❌ |
| `I64x8` | ✅ `__m512i` | 🟠 `[i64; 8]` polyfill | ❌ | 🔵 | ✅ |
| `I64x4` | ❌ | ❌ | ❌ | ❌ | ❌ |
| `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ |
| `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ |
| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | | 🔵 | ✅ |
| `F16x16` | ❌ | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🔵 | ✅ |
| `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ |
| `F32Mask8` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F32Mask8Scalar`) | ❌ exposed | 🔵 | ✅ via `F32Mask8Scalar` |
| `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ |

**Aarch64-native narrower types** (only useful directly when the
consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
parity surface — consumers requesting 256-bit / 512-bit shapes go
through the composed wrappers.
| `F64Mask4` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F64Mask4Scalar`) | ❌ exposed | 🔵 | ✅ via `F64Mask4Scalar` |

### Sub-byte lanes (not a SIMD wrapper anywhere)

**`I4` / `U4`** — 4-bit (nibble) lanes used by INT4 quantized inference
(GGUF Q4_0 / Q4_K, GPTQ, AWQ). No first-class wrapper exists or is
planned. Consumers pack 2× nibbles per byte and operate through
`U8x64` with `shr_epi16` + `& 0x0F` masks; the same trick gives them
the 128- and 64-byte shapes via the existing AVX-512 / AVX2 / NEON
paths. If a first-class `I4x128` were ever wanted, AVX-512 VBMI2's
`VPCOMPRESSB` + `VPEXPANDB` and AVX-512 IFMA's `VPMADD52` give the
hardware story; on aarch64 there's no native nibble support and the
shr+mask trick stays. Tracked as TD-SIMD-11 if a consumer files for it.

### Aarch64-native narrower types

Only useful directly when the consumer wants 128-bit shapes:
`I8x16`, `I16x8`, `U8x16`, `U16x8`, `U32x4`, `U64x2`, `I32x4`, `I64x2`.
These are not in the cross-arch parity surface — consumers requesting
256-bit / 512-bit shapes go through the composed wrappers.

### Gaps surfaced 2026-05-20

- **`F32x8` / `F64x4` are universal on x86**, even on the v3 / AVX2 path
— they share the `__m256` / `__m256d` declarations exposed by
`simd_avx512.rs` (AVX, not AVX-512; works on every host with AVX
support, i.e. Sandy Bridge+). The previous matrix marked them `❌`
in the v3 column — corrected above.
- **`U32x8` / `U64x4`** exist only in `simd_nightly` (via `core::simd`).
No native or polyfill wrapper on x86 or aarch64. Add to `simd_avx512`
+ `simd_scalar` if a consumer needs them at 256-bit width.
- **`I32x8` / `I64x4` / `U16x16`** missing across every backend (incl.
nightly). Theoretical 256-bit shapes that no consumer has reached for
yet; add to backlog if needed.
- **`F32Mask8` / `F64Mask4`** are declared in `simd_scalar` as
`F32Mask8Scalar` / `F64Mask4Scalar` (the rename came from a duplicate-
decl conflict on i686 — see `src/simd_scalar.rs:340-345`). Not
surfaced through `crate::simd::*`. If consumers want these mask
widths, expose them and unify the name (drop the `Scalar` suffix on
AVX-512 where `__mmask8` natively maps to F64Mask8 already; the
256-bit f64 lane width needs a 4-bit mask which `__mmask8` can hold
but isn't yet typed as `F64Mask4`).

### Read of the matrix

Expand Down Expand Up @@ -199,7 +241,7 @@ Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
| **TD-SIMD-8** | **P2** | `F16x16` in `src/simd_half.rs:123` is a scalar `[u16; 16]` polyfill — every arithmetic op upcasts to f32, computes, downcasts. Consumers using `crate::simd::F16x16` get scalar perf even on AVX-512 hardware with `vcvtph2ps` / `vcvtps2ph`. (`F16Scaler` in `simd_avx2.rs:2566` is unrelated — it's a *scaling context* for range-normalizing values before f16 encoding, not the F16x16 SIMD type.) | inspection of `src/simd_half.rs:115-150` | (a) Replace the `[u16; 16]` storage with `__m256i` + `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under `target_feature = "f16c"` (Sapphire Rapids+, all Skylake AVX-512). (b) Add an `F16x16Scalar` alias and route consumers explicitly. (c) Add a doc-warning at the type level pointing at the architecture doc. ~80 LoC. |
| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |

Expand Down
11 changes: 11 additions & 0 deletions src/simd_avx2.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2562,6 +2562,17 @@ pub fn f16_kahan_dot(a: &[u16], b: &[u16]) -> f32 {
///
/// Analyzes the input range, computes scale that maps |max| → 1.0,
/// then uses that scale for all encode/decode operations.
///
/// # NOT a SIMD type
///
/// This is a *scaling utility* — it normalizes value ranges before
/// f32 → f16 conversion so the dynamic range maps cleanly into f16's
/// `[-65504, 65504]` window. The SIMD f16 wrapper is `simd_half::F16x16`
/// (also a scalar polyfill on stable — see TD-SIMD-8 in
/// `.claude/knowledge/simd-dispatch-architecture.md`). Earlier versions
/// of the architecture doc's parity matrix mistakenly listed
/// `F16Scaler` in the `F16x16` row's AVX2 column; the two are
/// unrelated.
#[derive(Debug, Clone, Copy)]
pub struct F16Scaler {
/// Multiply by this before f32→f16 (shifts into sweet spot)
Expand Down
23 changes: 22 additions & 1 deletion src/simd_half.rs
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,28 @@ impl BF16x16 {

/// 16 × F16 (IEEE 754 binary16) packed into a scalar array.
///
/// All arithmetic operates via f32 upcast → op → F16 downcast (round-to-nearest-even).
/// # Scalar-perf disclosure (TD-SIMD-8)
///
/// **This is a scalar polyfill, not a SIMD type.** Storage is plain
/// `[u16; 16]` (no `__m256i` / `__m256bh` / `float16x8_t`). Every
/// arithmetic op upcasts to f32, computes lane-by-lane, downcasts back
/// to f16 with round-to-nearest-even — same path on every backend
/// (AVX-512, AVX2, NEON, scalar). Consumers in hot loops should NOT
/// reach for `crate::simd::F16x16` expecting SIMD throughput.
///
/// The hardware-native paths exist on x86 via `_mm256_cvtph_ps` /
/// `_mm256_cvtps_ph` (F16C; Ivy Bridge+) and on aarch64 via
/// `vfmaq_f16` (ARMv8.2-A `+fp16`; Pi 5, Apple, modern Snapdragons).
/// Wiring those into `F16x16` is tracked as TD-SIMD-8 in
/// `.claude/knowledge/simd-dispatch-architecture.md`. Until then, hot
/// loops on f16 should use `core::simd::f16x16` under the `nightly-simd`
/// feature (real `core::simd::*` codegen) or stay in f32 and convert
/// at storage boundaries.
///
/// Not to be confused with `simd_avx2::F16Scaler` — that's a *scaling
/// context* for range-normalizing values before f16 encoding (so the
/// dynamic range maps to f16's `[-65504, 65504]` window without
/// clipping), not a SIMD lane type.
#[derive(Clone, Copy, Debug)]
pub struct F16x16([u16; 16]);

Expand Down
Loading