You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# F16 honesty (TD-SIMD-8)
`src/simd_half.rs` F16x16: docstring now explicitly discloses scalar
storage and routes hot loops to `core::simd::f16x16` (under
`nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates
from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing
values before f16 encoding, not the F16x16 SIMD type. Both files cross-
reference each other so a future reader doesn't repeat the confusion.
`src/simd_avx2.rs` F16Scaler: docstring strengthened with the same
disambiguation note.
# Matrix audit (user request)
Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs,
simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the
architecture doc. Corrections:
- **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**.
The dispatch at `src/simd.rs:294` already imports these from
simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they
work on every Sandy Bridge+ host. The matrix was stale.
- **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 +
aarch64 + scalar. core::simd has them via `simd_nightly`.
- **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend
including nightly. Theoretical 256-bit shapes no consumer has reached
for yet.
- **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as
`F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate-
decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512
has them natively via `__mmask8` but they're not typed.
- **Sub-byte lanes section added** — I4 / U4 lanes used by INT4
quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper;
consumers pack 2× nibbles per byte and operate through U8x64 + shr/
mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on
x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer
files for it.
TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the
actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated
F16Scaler scaling utility).
These are not in the cross-arch parity surface — consumers requesting
191
+
256-bit / 512-bit shapes go through the composed wrappers.
192
+
193
+
### Gaps surfaced 2026-05-20
194
+
195
+
-**`F32x8` / `F64x4` are universal on x86**, even on the v3 / AVX2 path
196
+
— they share the `__m256` / `__m256d` declarations exposed by
197
+
`simd_avx512.rs` (AVX, not AVX-512; works on every host with AVX
198
+
support, i.e. Sandy Bridge+). The previous matrix marked them `❌`
199
+
in the v3 column — corrected above.
200
+
-**`U32x8` / `U64x4`** exist only in `simd_nightly` (via `core::simd`).
201
+
No native or polyfill wrapper on x86 or aarch64. Add to `simd_avx512`
202
+
+`simd_scalar` if a consumer needs them at 256-bit width.
203
+
-**`I32x8` / `I64x4` / `U16x16`** missing across every backend (incl.
204
+
nightly). Theoretical 256-bit shapes that no consumer has reached for
205
+
yet; add to backlog if needed.
206
+
-**`F32Mask8` / `F64Mask4`** are declared in `simd_scalar` as
207
+
`F32Mask8Scalar` / `F64Mask4Scalar` (the rename came from a duplicate-
208
+
decl conflict on i686 — see `src/simd_scalar.rs:340-345`). Not
209
+
surfaced through `crate::simd::*`. If consumers want these mask
210
+
widths, expose them and unify the name (drop the `Scalar` suffix on
211
+
AVX-512 where `__mmask8` natively maps to F64Mask8 already; the
212
+
256-bit f64 lane width needs a 4-bit mask which `__mmask8` can hold
213
+
but isn't yet typed as `F64Mask4`).
172
214
173
215
### Read of the matrix
174
216
@@ -199,7 +241,7 @@ Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
199
241
|**TD-SIMD-5**|**P1**| Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
200
242
|**TD-SIMD-6**|**P2**| No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. |`grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55`| New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
201
243
|**TD-SIMD-7**|**P2**| Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
202
-
|**TD-SIMD-8**|**P2**|`F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16`on v3 get scalar perf without warning. | grep `F16Scaler`|Either gate `F16x16` behind `target_feature = "f16c"`or rename / document the scalar nature. ~20 LoC + docs. |
244
+
|**TD-SIMD-8**|**P2**|`F16x16` in `src/simd_half.rs:123` is a scalar `[u16; 16]` polyfill — every arithmetic op upcasts to f32, computes, downcasts. Consumers using `crate::simd::F16x16` get scalar perf even on AVX-512 hardware with `vcvtph2ps` / `vcvtps2ph`. (`F16Scaler` in `simd_avx2.rs:2566` is unrelated — it's a *scaling context* for range-normalizing values before f16 encoding, not the F16x16 SIMD type.) | inspection of `src/simd_half.rs:115-150`|(a) Replace the `[u16; 16]` storage with `__m256i` + `_mm256_cvtph_ps` / `_mm256_cvtps_ph` under `target_feature = "f16c"`(Sapphire Rapids+, all Skylake AVX-512). (b) Add an `F16x16Scalar` alias and route consumers explicitly. (c) Add a doc-warning at the type level pointing at the architecture doc. ~80 LoC. |
203
245
|**TD-SIMD-9**|**P3**| No CI matrix entry for the `nightly-simd` polyfill path. |`.github/workflows/ci.yaml`| Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
204
246
|**TD-SIMD-10**|**P3**| No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. |`.github/workflows/ci.yaml`| Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
0 commit comments