Skip to content

Commit 2f096d3

Browse files
authored
Merge pull request #178 from AdaWorldAPI/claude/pr-x-td8-f16-honesty
docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes
2 parents a08173b + 63f91df commit 2f096d3

3 files changed

Lines changed: 85 additions & 11 deletions

File tree

.claude/knowledge/simd-dispatch-architecture.md

Lines changed: 52 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -144,31 +144,73 @@ tracked as TD-SIMD-3.)
144144
| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
145145
|---|---|---|---|---|---|
146146
| `F32x16` |`__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` |`[f32; 16]` |
147-
| `F32x8` |`__m256` | || 🔵 ||
147+
| `F32x8` |`__m256` | `__m256` (in `simd_avx512`) || 🔵 ||
148148
| `F64x8` |`__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 ||
149-
| `F64x4` |`__m256d` | || 🔵 ||
149+
| `F64x4` |`__m256d` | `__m256d` (in `simd_avx512`) || 🔵 ||
150150
| `U8x64` |`__m512i` | 🟠 `[u8; 64]` polyfill || 🔵 ||
151151
| `U8x32` |`__m256i` |`__m256i` || 🔵 ||
152152
| `U16x32` |`__m512i` | 🟠 `[u16; 32]` polyfill || 🔵 ||
153+
| `U16x16` ||||||
153154
| `U32x16` |`__m512i` | 🟠 `[u32; 16]` polyfill || 🔵 ||
155+
| `U32x8` |||| 🔵 `core::simd::u32x8` ||
154156
| `U64x8` |`__m512i` | 🟠 `[u64; 8]` polyfill || 🔵 ||
157+
| `U64x4` |||| 🔵 `core::simd::u64x4` ||
155158
| `I8x32` |`__m256i` |`__m256i` (in `simd_avx512`) || 🔵 ||
156159
| `I8x64` |`__m512i` | 🟠 `[i8; 64]` polyfill || 🔵 ||
157160
| `I16x16` |`__m256i` |`__m256i` (in `simd_avx512`) || 🔵 ||
158161
| `I16x32` |`__m512i` | 🟠 `[i16; 32]` polyfill || 🔵 ||
159162
| `I32x16` |`__m512i` | 🟠 `[i32; 16]` polyfill || 🔵 ||
163+
| `I32x8` ||||||
160164
| `I64x8` |`__m512i` | 🟠 `[i64; 8]` polyfill || 🔵 ||
165+
| `I64x4` ||||||
161166
| `BF16x8` |`__m128bh` ||| 🔵 ||
162167
| `BF16x16` |`__m256bh` ||| 🔵 ||
163-
| `F16x16` || 🟡 `F16Scaler` (scalar) | | 🔵 ||
168+
| `F16x16` || 🟠 `[u16; 16]` polyfill (`simd_half`) | 🟠 `[u16; 16]` polyfill (`simd_half`) | 🔵 ||
164169
| `F32Mask16` |`__mmask16` |`u16` bitmask |`u16` bitmask | 🔵 ||
170+
| `F32Mask8` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F32Mask8Scalar`) | ❌ exposed | 🔵 | ✅ via `F32Mask8Scalar` |
165171
| `F64Mask8` |`__mmask8` |`u8` bitmask |`u8` bitmask | 🔵 ||
166-
167-
**Aarch64-native narrower types** (only useful directly when the
168-
consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
169-
`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
170-
parity surface — consumers requesting 256-bit / 512-bit shapes go
171-
through the composed wrappers.
172+
| `F64Mask4` | ❌ exposed (avail via `__mmask8`) | ❌ exposed (avail in `simd_scalar` as `F64Mask4Scalar`) | ❌ exposed | 🔵 | ✅ via `F64Mask4Scalar` |
173+
174+
### Sub-byte lanes (not a SIMD wrapper anywhere)
175+
176+
**`I4` / `U4`** — 4-bit (nibble) lanes used by INT4 quantized inference
177+
(GGUF Q4_0 / Q4_K, GPTQ, AWQ). No first-class wrapper exists or is
178+
planned. Consumers pack 2× nibbles per byte and operate through
179+
`U8x64` with `shr_epi16` + `& 0x0F` masks; the same trick gives them
180+
the 128- and 64-byte shapes via the existing AVX-512 / AVX2 / NEON
181+
paths. If a first-class `I4x128` were ever wanted, AVX-512 VBMI2's
182+
`VPCOMPRESSB` + `VPEXPANDB` and AVX-512 IFMA's `VPMADD52` give the
183+
hardware story; on aarch64 there's no native nibble support and the
184+
shr+mask trick stays. Tracked as TD-SIMD-11 if a consumer files for it.
185+
186+
### Aarch64-native narrower types
187+
188+
Only useful directly when the consumer wants 128-bit shapes:
189+
`I8x16`, `I16x8`, `U8x16`, `U16x8`, `U32x4`, `U64x2`, `I32x4`, `I64x2`.
190+
These are not in the cross-arch parity surface — consumers requesting
191+
256-bit / 512-bit shapes go through the composed wrappers.
192+
193+
### Gaps surfaced 2026-05-20
194+
195+
- **`F32x8` / `F64x4` are universal on x86**, even on the v3 / AVX2 path
196+
— they share the `__m256` / `__m256d` declarations exposed by
197+
`simd_avx512.rs` (AVX, not AVX-512; works on every host with AVX
198+
support, i.e. Sandy Bridge+). The previous matrix marked them ``
199+
in the v3 column — corrected above.
200+
- **`U32x8` / `U64x4`** exist only in `simd_nightly` (via `core::simd`).
201+
No native or polyfill wrapper on x86 or aarch64. Add to `simd_avx512`
202+
+ `simd_scalar` if a consumer needs them at 256-bit width.
203+
- **`I32x8` / `I64x4` / `U16x16`** missing across every backend (incl.
204+
nightly). Theoretical 256-bit shapes that no consumer has reached for
205+
yet; add to backlog if needed.
206+
- **`F32Mask8` / `F64Mask4`** are declared in `simd_scalar` as
207+
`F32Mask8Scalar` / `F64Mask4Scalar` (the rename came from a duplicate-
208+
decl conflict on i686 — see `src/simd_scalar.rs:340-345`). Not
209+
surfaced through `crate::simd::*`. If consumers want these mask
210+
widths, expose them and unify the name (drop the `Scalar` suffix on
211+
AVX-512 where `__mmask8` natively maps to F64Mask8 already; the
212+
256-bit f64 lane width needs a 4-bit mask which `__mmask8` can hold
213+
but isn't yet typed as `F64Mask4`).
172214

173215
### Read of the matrix
174216

@@ -199,7 +241,7 @@ Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
199241
| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
200242
| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
201243
| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
202-
| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
244+
| **TD-SIMD-8** | **P2** | `F16x16` in `src/simd_half.rs:123` is a scalar `[u16; 16]` polyfill — every arithmetic op upcasts to f32, computes, downcasts. Consumers using `crate::simd::F16x16` get scalar perf even on AVX-512 hardware with `vcvtph2ps` / `vcvtps2ph`. (`F16Scaler` in `simd_avx2.rs:2566` is unrelated — it's a *scaling context* for range-normalizing values before f16 encoding, not the F16x16 SIMD type.) | inspection of `src/simd_half.rs:115-150` | (a) Replace the `[u16; 16]` storage with `__m256i` + `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under `target_feature = "f16c"` (Sapphire Rapids+, all Skylake AVX-512). (b) Add an `F16x16Scalar` alias and route consumers explicitly. (c) Add a doc-warning at the type level pointing at the architecture doc. ~80 LoC. |
203245
| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
204246
| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
205247

src/simd_avx2.rs

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2562,6 +2562,17 @@ pub fn f16_kahan_dot(a: &[u16], b: &[u16]) -> f32 {
25622562
///
25632563
/// Analyzes the input range, computes scale that maps |max| → 1.0,
25642564
/// then uses that scale for all encode/decode operations.
2565+
///
2566+
/// # NOT a SIMD type
2567+
///
2568+
/// This is a *scaling utility* — it normalizes value ranges before
2569+
/// f32 → f16 conversion so the dynamic range maps cleanly into f16's
2570+
/// `[-65504, 65504]` window. The SIMD f16 wrapper is `simd_half::F16x16`
2571+
/// (also a scalar polyfill on stable — see TD-SIMD-8 in
2572+
/// `.claude/knowledge/simd-dispatch-architecture.md`). Earlier versions
2573+
/// of the architecture doc's parity matrix mistakenly listed
2574+
/// `F16Scaler` in the `F16x16` row's AVX2 column; the two are
2575+
/// unrelated.
25652576
#[derive(Debug, Clone, Copy)]
25662577
pub struct F16Scaler {
25672578
/// Multiply by this before f32→f16 (shifts into sweet spot)

src/simd_half.rs

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,28 @@ impl BF16x16 {
118118

119119
/// 16 × F16 (IEEE 754 binary16) packed into a scalar array.
120120
///
121-
/// All arithmetic operates via f32 upcast → op → F16 downcast (round-to-nearest-even).
121+
/// # Scalar-perf disclosure (TD-SIMD-8)
122+
///
123+
/// **This is a scalar polyfill, not a SIMD type.** Storage is plain
124+
/// `[u16; 16]` (no `__m256i` / `__m256bh` / `float16x8_t`). Every
125+
/// arithmetic op upcasts to f32, computes lane-by-lane, downcasts back
126+
/// to f16 with round-to-nearest-even — same path on every backend
127+
/// (AVX-512, AVX2, NEON, scalar). Consumers in hot loops should NOT
128+
/// reach for `crate::simd::F16x16` expecting SIMD throughput.
129+
///
130+
/// The hardware-native paths exist on x86 via `_mm256_cvtph_ps` /
131+
/// `_mm256_cvtps_ph` (F16C; Ivy Bridge+) and on aarch64 via
132+
/// `vfmaq_f16` (ARMv8.2-A `+fp16`; Pi 5, Apple, modern Snapdragons).
133+
/// Wiring those into `F16x16` is tracked as TD-SIMD-8 in
134+
/// `.claude/knowledge/simd-dispatch-architecture.md`. Until then, hot
135+
/// loops on f16 should use `core::simd::f16x16` under the `nightly-simd`
136+
/// feature (real `core::simd::*` codegen) or stay in f32 and convert
137+
/// at storage boundaries.
138+
///
139+
/// Not to be confused with `simd_avx2::F16Scaler` — that's a *scaling
140+
/// context* for range-normalizing values before f16 encoding (so the
141+
/// dynamic range maps to f16's `[-65504, 65504]` window without
142+
/// clipping), not a SIMD lane type.
122143
#[derive(Clone, Copy, Debug)]
123144
pub struct F16x16([u16; 16]);
124145

0 commit comments

Comments
 (0)