Commit d50caaf
committed
feat(simd_runtime): CpuOps DTO — third dispatch pattern + GCC-scraped CPU table
Two additions, scoped together because they're the same idea — using
scraped CPU metadata to drive runtime dispatch:
# Piece A: matrix doc § M (GCC-grounded aarch64 enumeration)
The matrix had three aarch64 columns (A53 / A72 / A76) covering
*dispatch tiers* (multiple physical cores share each tier's SIMD
primitive set). The authoritative per-core feature membership lives
in GCC's `gcc/config/aarch64/aarch64-cores.def` — scraped 2026-05-21
and recorded as a new § M table covering 28 cores:
* V8.0-A baseline (A53, A72)
* V8.2-A dotprod+fp16 (A76, A78, X1, Neoverse-N1, Apple M1)
* V8.5-A baseline (Apple M1 specifically — V8.5 includes V8.2's
fp16+dotprod but NOT bf16+i8mm; corrects a wrong "+bf16" claim
on the existing A76 row of the column legend)
* V8.6-A baseline incl. bf16+i8mm (Apple M2/M3, Oryon-1 / Snapdragon
X Elite, Ampere1+, Cortex-A510/A710/A715, X2/X3, Neoverse-N2/V2)
* V8.7-A (Apple M4, Ampere1B)
* V9.0-A SVE2 baseline + explicit bf16+i8mm flags (Cortex-A510-A715,
X2/X3, Neoverse-N2/V2, Grace)
* V8.4-A SVE tier (Neoverse-V1 / Graviton 3 — only V8.4 core with
explicit SVE+bf16+i8mm)
* V9.2-A (Cortex-A520/A720/A725, X4, X925, Neoverse-N3/V3)
Each entry verbatim from the GCC FEATURE_STRING column. Cross-
referencing with the V8.X-A baseline rules (V8.6+ includes bf16+i8mm
implicitly; V9.0 includes SVE2 implicitly) gives the canonical
"which silicon has what" table. The note flags that a new dispatch
column for the V8.6+/V9-bf16-i8mm tier needs to land alongside the
NEON BFMMLA / BFDOT asm-byte arm in Phase 3b.
The A76 column legend (line 26 of the matrix) was corrected: removed
the wrong "+bf16" (A76 itself is V8.2-A, NO bf16 — bf16 came in
V8.6-A).
# Piece B: CpuOps DTO — third dispatch pattern
Adds `src/simd_runtime/cpu_ops.rs` exposing a per-CPU operations DTO
distinct from the existing patterns:
Pattern 1 (`crate::simd::*`): compile-time `#[cfg(target_feature)]`
cascade. Direct monomorphized calls.
Pattern 2 (`crate::simd_runtime::vnni_dot_u8_i8` etc., from #185):
per-op LazyLock<fn ptr>. One CPUID +
atomic-load per op the first time
called.
Pattern 3 (THIS COMMIT): per-CPU `&'static CpuOps` selected
once at first access. Every op is a
fn-ptr field on the struct.
Why the third pattern?
* Per-op LazyLock: N ops touched = N atomic-load setup costs over
the process lifetime.
* CpuOps DTO: ONE atomic-load total at first `cpu_ops()` call;
every subsequent op is a direct fn-ptr deref through the cached
`&'static CpuOps`. The OpenBLAS / MKL dispatch model — wins for
dense-op consumers (linear-algebra pipelines touching every
BLAS-1/2/3 kernel).
* All three coexist. Consumers pick by import path.
Six tiers baked as static const `CpuOps` instances:
x86_64: amx_int8, avx512vnni, avx512f, avxvnni, avx2_fma
aarch64: neon
universal: scalar
Each instance points at the existing trampolines in
`crate::simd_runtime::{vnni_dot, add_mul}` — no kernel duplication;
this module is pure dispatch glue. Backend ops referenced:
vnni_dot_u8_i8 (3 backends: avx512+tail / avxvnni / scalar)
add_mul_f32 (4 backends: avx512 / avx2+fma / neon / scalar)
add_mul_f64 (4 backends: avx512 / avx2+fma / neon / scalar)
# The naughty data-driven part
`cpu_ops_for_cpu(name: &str) -> Option<&'static CpuOps>` maps GCC
CPU codenames to the dispatch tier they land in, sourced from § M's
scrape. Spot-checks (each verified by the test suite):
sapphirerapids / graniterapids / emeraldrapids → amx_int8
cascadelake / cooperlake / icelake-* / tigerlake / rocketlake
/ znver4 / znver5 → avx512vnni
alderlake / raptorlake / meteorlake / arrowlake / arrowlake-s
/ lunarlake / pantherlake / sierraforest → avxvnni
haswell / broadwell / skylake / znver1-3 → avx2_fma
apple-m1..m4 / oryon-1 / cortex-a76..a725
/ cortex-x1..x925 / neoverse-n1..v3 / grace
/ ampere1..1b → neon
Returns `None` for unknown CPUs — caller can fall back to
`cpu_ops_for_tier("scalar")` if a "best-effort" answer is needed.
Use cases for `cpu_ops_for_cpu`:
* "What would $CPU pick?" introspection without running on $CPU.
* Cross-compilation reports + deployment-planning tools.
* Integration tests asserting tier selection for named targets.
* Explicit-tier-pinning ("force AVX2 even though AMX is available,
to measure overhead").
Future: code-gen the table from a `build.rs` that fetches GCC's
latest core list. Today the table is hand-rolled from the scrape
recorded in matrix doc § M.
# Verification
* `cargo test --lib --features runtime-dispatch`: 2147 tests pass
(was 2105 — +5 new cpu_ops tests + 37 carried over from prior
feature-gated tests now compiled-in too).
* 5 new cpu_ops tests:
cpu_ops_resolves_on_this_host
cpu_ops_stable_across_calls (LazyLock fires once)
cpu_ops_for_tier_known_names
cpu_ops_for_cpu_data_driven_lookup (spot-checks the GCC scrape)
cpu_ops_call_through_dto (full indirect-call exercise)
* cargo clippy --lib --tests --features rayon,native,runtime-dispatch
-- -D warnings clean.
* cargo fmt --all --check clean.
* Default build (no feature) unchanged: zero impact on existing
paths — the entire `simd_runtime` module is gated out.
# Backward-compat for the existing per-op LazyLock surface
The pub(super) wrappers in `vnni_dot.rs` and `add_mul.rs`
(`*_safe` / `*_safe_wrapper` / `*_scalar_wrapper`) are new but
purely additive — every existing public function in `simd_runtime`
keeps its prior signature and dispatch behavior.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u1 parent 8739d90 commit d50caaf
5 files changed
Lines changed: 519 additions & 1 deletion
File tree
- .claude/knowledge
- src/simd_runtime
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
| 26 | + | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| |||
530 | 530 | | |
531 | 531 | | |
532 | 532 | | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
533 | 603 | | |
534 | 604 | | |
535 | 605 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
228 | 228 | | |
229 | 229 | | |
230 | 230 | | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
231 | 276 | | |
232 | 277 | | |
233 | 278 | | |
| |||
0 commit comments