Skip to content

Commit d50caaf

Browse files
committed
feat(simd_runtime): CpuOps DTO — third dispatch pattern + GCC-scraped CPU table
Two additions, scoped together because they're the same idea — using scraped CPU metadata to drive runtime dispatch: # Piece A: matrix doc § M (GCC-grounded aarch64 enumeration) The matrix had three aarch64 columns (A53 / A72 / A76) covering *dispatch tiers* (multiple physical cores share each tier's SIMD primitive set). The authoritative per-core feature membership lives in GCC's `gcc/config/aarch64/aarch64-cores.def` — scraped 2026-05-21 and recorded as a new § M table covering 28 cores: * V8.0-A baseline (A53, A72) * V8.2-A dotprod+fp16 (A76, A78, X1, Neoverse-N1, Apple M1) * V8.5-A baseline (Apple M1 specifically — V8.5 includes V8.2's fp16+dotprod but NOT bf16+i8mm; corrects a wrong "+bf16" claim on the existing A76 row of the column legend) * V8.6-A baseline incl. bf16+i8mm (Apple M2/M3, Oryon-1 / Snapdragon X Elite, Ampere1+, Cortex-A510/A710/A715, X2/X3, Neoverse-N2/V2) * V8.7-A (Apple M4, Ampere1B) * V9.0-A SVE2 baseline + explicit bf16+i8mm flags (Cortex-A510-A715, X2/X3, Neoverse-N2/V2, Grace) * V8.4-A SVE tier (Neoverse-V1 / Graviton 3 — only V8.4 core with explicit SVE+bf16+i8mm) * V9.2-A (Cortex-A520/A720/A725, X4, X925, Neoverse-N3/V3) Each entry verbatim from the GCC FEATURE_STRING column. Cross- referencing with the V8.X-A baseline rules (V8.6+ includes bf16+i8mm implicitly; V9.0 includes SVE2 implicitly) gives the canonical "which silicon has what" table. The note flags that a new dispatch column for the V8.6+/V9-bf16-i8mm tier needs to land alongside the NEON BFMMLA / BFDOT asm-byte arm in Phase 3b. The A76 column legend (line 26 of the matrix) was corrected: removed the wrong "+bf16" (A76 itself is V8.2-A, NO bf16 — bf16 came in V8.6-A). # Piece B: CpuOps DTO — third dispatch pattern Adds `src/simd_runtime/cpu_ops.rs` exposing a per-CPU operations DTO distinct from the existing patterns: Pattern 1 (`crate::simd::*`): compile-time `#[cfg(target_feature)]` cascade. Direct monomorphized calls. Pattern 2 (`crate::simd_runtime::vnni_dot_u8_i8` etc., from #185): per-op LazyLock<fn ptr>. One CPUID + atomic-load per op the first time called. Pattern 3 (THIS COMMIT): per-CPU `&'static CpuOps` selected once at first access. Every op is a fn-ptr field on the struct. Why the third pattern? * Per-op LazyLock: N ops touched = N atomic-load setup costs over the process lifetime. * CpuOps DTO: ONE atomic-load total at first `cpu_ops()` call; every subsequent op is a direct fn-ptr deref through the cached `&'static CpuOps`. The OpenBLAS / MKL dispatch model — wins for dense-op consumers (linear-algebra pipelines touching every BLAS-1/2/3 kernel). * All three coexist. Consumers pick by import path. Six tiers baked as static const `CpuOps` instances: x86_64: amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma aarch64: neon universal: scalar Each instance points at the existing trampolines in `crate::simd_runtime::{vnni_dot, add_mul}` — no kernel duplication; this module is pure dispatch glue. Backend ops referenced: vnni_dot_u8_i8 (3 backends: avx512+tail / avxvnni / scalar) add_mul_f32 (4 backends: avx512 / avx2+fma / neon / scalar) add_mul_f64 (4 backends: avx512 / avx2+fma / neon / scalar) # The naughty data-driven part `cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>` maps GCC CPU codenames to the dispatch tier they land in, sourced from § M's scrape. Spot-checks (each verified by the test suite): sapphirerapids / graniterapids / emeraldrapids → amx_int8 cascadelake / cooperlake / icelake-* / tigerlake / rocketlake / znver4 / znver5 → avx512vnni alderlake / raptorlake / meteorlake / arrowlake / arrowlake-s / lunarlake / pantherlake / sierraforest → avxvnni haswell / broadwell / skylake / znver1-3 → avx2_fma apple-m1..m4 / oryon-1 / cortex-a76..a725 / cortex-x1..x925 / neoverse-n1..v3 / grace / ampere1..1b → neon Returns `None` for unknown CPUs — caller can fall back to `cpu_ops_for_tier("scalar")` if a "best-effort" answer is needed. Use cases for `cpu_ops_for_cpu`: * "What would $CPU pick?" introspection without running on $CPU. * Cross-compilation reports + deployment-planning tools. * Integration tests asserting tier selection for named targets. * Explicit-tier-pinning ("force AVX2 even though AMX is available, to measure overhead"). Future: code-gen the table from a `build.rs` that fetches GCC's latest core list. Today the table is hand-rolled from the scrape recorded in matrix doc § M. # Verification * `cargo test --lib --features runtime-dispatch`: 2147 tests pass (was 2105 — +5 new cpu_ops tests + 37 carried over from prior feature-gated tests now compiled-in too). * 5 new cpu_ops tests: cpu_ops_resolves_on_this_host cpu_ops_stable_across_calls (LazyLock fires once) cpu_ops_for_tier_known_names cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape) cpu_ops_call_through_dto (full indirect-call exercise) * cargo clippy --lib --tests --features rayon,native,runtime-dispatch -- -D warnings clean. * cargo fmt --all --check clean. * Default build (no feature) unchanged: zero impact on existing paths — the entire `simd_runtime` module is gated out. # Backward-compat for the existing per-op LazyLock surface The pub(super) wrappers in `vnni_dot.rs` and `add_mul.rs` (`*_safe` / `*_safe_wrapper` / `*_scalar_wrapper`) are new but purely additive — every existing public function in `simd_runtime` keeps its prior signature and dispatch behavior. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent 8739d90 commit d50caaf

5 files changed

Lines changed: 519 additions & 1 deletion

File tree

.claude/knowledge/agnostic-surface-cpu-matrix.md

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Same set as `td-simd-cpu-dispatch-matrix.md` § "Master matrix — x86_64" and
2323
| Z5 | `znver5` / `Zen4Avx512` (same dispatch) | AMD 2024 | same as Z4 + minor uarch |
2424
| ARL | `arrowlake` / `ArrowLake` | Intel 2024 | AVX2+FMA + AVX-VNNI+VNNI-INT8 |
2525
| HSW | `x86-64-v3` / `HaswellAvx2` | Intel 2013→2021 | AVX2+FMA (no VNNI/AVX-512) |
26-
| A76 | `cortex-a76` / `A76DotProd` | ARMv8.2 (Pi 5, M1) | NEON+dotprod+bf16+fp16 |
26+
| A76 | `cortex-a76` / `A76DotProd` | ARMv8.2 (Pi 5) | NEON+dotprod+fp16 (no bf16 / i8mm — those are V8.6+, see § M) |
2727
| A72 | `cortex-a72` / `A72Fast` | ARMv8.0 (Pi 4) | NEON only (no dotprod) |
2828
| A53 | `cortex-a53` / `A53Baseline` | ARMv8.0 (Pi 3/Z2W) | NEON, lower IPC |
2929
| SCA | scalar fallback | wasm32/riscv/i686 | no SIMD |
@@ -530,6 +530,76 @@ verifies that no per-CPU regression has crept in vs the historical baseline:
530530
`crate::simd::*`, this table must grow a row. Reviewers should reject
531531
PRs that add a public symbol without a corresponding matrix entry.
532532
533+
## M. AArch64 ground-truth core enumeration (GCC source)
534+
535+
The matrix above uses three aarch64 columns (A53 / A72 / A76) that
536+
each cover a *dispatch tier*multiple physical cores share the same
537+
SIMD primitive set. The authoritative per-core feature membership is
538+
in GCC's `gcc/config/aarch64/aarch64-cores.def`, scraped 2026-05-21:
539+
540+
| Core | GCC arch | Explicit feature flags |
541+
|---|---|---|
542+
| **A53/A72/A76 tier** (baseline NEON, optional dotprod+fp16, NO bf16) | | |
543+
| `cortex-a53` | V8-A | `(CRC)` |
544+
| `cortex-a72` | V8-A | `(CRC)` |
545+
| `cortex-a76` | V8.2-A | `F16, RCPC, DOTPROD` |
546+
| `cortex-a78` | V8.2-A | `F16, RCPC, DOTPROD, SSBS, PROFILE` |
547+
| `cortex-x1` | V8.2-A | `F16, RCPC, DOTPROD, SSBS, PROFILE` |
548+
| `neoverse-n1`| V8.2-A | `F16, RCPC, DOTPROD, PROFILE` |
549+
| `apple-m1` | V8.5-A | `()` — V8.5 baseline includes F16+dotprod, NO bf16/i8mm |
550+
| **V8.6-A tier** (BF16 + I8MM via baseline) | | |
551+
| `apple-m2` | V8.6-A | `()` — V8.6 baselinebf16, i8mm, sve, sve2 |
552+
| `apple-m3` | V8.6-A | same |
553+
| `oryon-1` | V8.6-A | `CRYPTO, SM4, SHA3, F16` (Snapdragon X Elite/Plus) |
554+
| `ampere1` | V8.6-A | `F16, RNG, AES, SHA3` |
555+
| `ampere1a` | V8.6-A | `F16, RNG, AES, SHA3, SM4, MEMTAG` |
556+
| **V8.7-A tier** (baseline + LS64 + MOPS) | | |
557+
| `apple-m4` | V8.7-A | `()` |
558+
| `ampere1b` | V8.7-A | `F16, RNG, AES, SHA3, SM4, MEMTAG, CSSC` |
559+
| **V9.0-A tier** (SVE2 baseline + explicit bf16/i8mm) | | |
560+
| `cortex-a510`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
561+
| `cortex-a710`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
562+
| `cortex-a715`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
563+
| `cortex-x2` | V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
564+
| `cortex-x3` | V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
565+
| `neoverse-n2`| V9-A | `I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE` |
566+
| `neoverse-v2`| V9-A | `I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE` (Graviton 4) |
567+
| `grace` | V9-A | `I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE` |
568+
| **V8.4-A SVE tier** (Graviton 3's odd one) | | |
569+
| `neoverse-v1`| V8.4-A | `SVE, I8MM, BF16, PROFILE, SSBS, RNG` |
570+
| **V9.2-A tier** (V9 + V8.7 features) | | |
571+
| `cortex-a520`| V9.2-A | `SVE2_BITPERM, MEMTAG` |
572+
| `cortex-a720`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
573+
| `cortex-a725`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
574+
| `cortex-x4` | V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
575+
| `cortex-x925`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
576+
| `neoverse-n3`| V9.2-A | `SVE2_BITPERM, RNG, MEMTAG, PROFILE` |
577+
| `neoverse-v3`| V9.2-A | `SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE` |
578+
579+
**Dispatch tier mapping (which matrix column each core lands in):**
580+
581+
| Tier (matrix col.) | Cores |
582+
|---|---|
583+
| A53 | `cortex-a53`, older V8.0-A |
584+
| A72 | `cortex-a72`, V8.0-A + CRC |
585+
| A76 (V8.2 with dotprod+fp16, NO bf16/i8mm) | `cortex-a76`, `cortex-a78`, `cortex-x1`, `neoverse-n1`, `apple-m1` |
586+
| **(new tierV8.6+/V9 with bf16+i8mm)** | `apple-m2`+, `oryon-1` (Snapdragon X), `cortex-a510`+, `neoverse-n2`/`v2`/`grace`, `ampere1`+ |
587+
| **(new tierV8.4-A + SVE + bf16+i8mm)** | `neoverse-v1` (Graviton 3only V8.4-A core with explicit SVE+bf16+i8mm) |
588+
589+
The matrix's three aarch64 columns cover the bottom of the dispatch
590+
ladder. The bf16/i8mm tier (which would carry NEON BFMMLA / BFDOT /
591+
USDOT / FMLA.8h) needs its own column in a future revisionwhen the
592+
NEON BF16 asm-byte arm lands (Phase 3b in § J), every V8.6+ core
593+
listed above gets covered by the same dispatch arm.
594+
595+
**Source provenance:** scraped from
596+
`https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/aarch64/aarch64-cores.def`
597+
(GCC trunk, 2026-05-21). The `AARCH64_CORE(...)` macro emits the
598+
canonical namearchfeature-string mapping; GCC's
599+
`(define_insn ...)` patterns in `aarch64-simd.md` give the bit
600+
encodings for the asm-byte rule (`.inst 0xXXXXXXXX`) that Phase 3b
601+
will use for BFMMLA / BFDOT / FMLA.8h / USDOT.
602+
533603
## L. Provenance
534604
535605
- CPU feature presence: sourced from `td-simd-cpu-dispatch-matrix.md`.

src/simd_runtime/add_mul.rs

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,51 @@ unsafe fn add_mul_f64_scalar(acc: &mut [f64], a: &[f64], b: &[f64]) {
228228
}
229229
}
230230

231+
// ────────────────────────────────────────────────────────────────────────
232+
// CpuOps DTO entry points — pub(super) wrappers for cpu_ops.rs to
233+
// reference the tier-specific kernels by name in static const decls.
234+
// Each one has the safety invariant guaranteed by the cpu_ops()
235+
// LazyLock that installed the parent &'static CpuOps.
236+
// ────────────────────────────────────────────────────────────────────────
237+
238+
#[cfg(target_arch = "x86_64")]
239+
pub(super) unsafe fn add_mul_f32_avx512_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
240+
add_mul_f32_avx512(acc, a, b)
241+
}
242+
243+
#[cfg(target_arch = "x86_64")]
244+
pub(super) unsafe fn add_mul_f64_avx512_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
245+
add_mul_f64_avx512(acc, a, b)
246+
}
247+
248+
#[cfg(target_arch = "x86_64")]
249+
pub(super) unsafe fn add_mul_f32_avx2_fma_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
250+
add_mul_f32_avx2_fma(acc, a, b)
251+
}
252+
253+
#[cfg(target_arch = "x86_64")]
254+
pub(super) unsafe fn add_mul_f64_avx2_fma_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
255+
add_mul_f64_avx2_fma(acc, a, b)
256+
}
257+
258+
#[cfg(target_arch = "aarch64")]
259+
pub(super) unsafe fn add_mul_f32_neon_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
260+
add_mul_f32_neon(acc, a, b)
261+
}
262+
263+
#[cfg(target_arch = "aarch64")]
264+
pub(super) unsafe fn add_mul_f64_neon_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
265+
add_mul_f64_neon(acc, a, b)
266+
}
267+
268+
pub(super) unsafe fn add_mul_f32_scalar_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
269+
add_mul_f32_scalar(acc, a, b)
270+
}
271+
272+
pub(super) unsafe fn add_mul_f64_scalar_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
273+
add_mul_f64_scalar(acc, a, b)
274+
}
275+
231276
#[cfg(test)]
232277
mod tests {
233278
use super::*;

0 commit comments

Comments
 (0)