Skip to content

Commit bc307ec

Browse files
authored
Merge pull request #187 from AdaWorldAPI/claude/continue-ndarray-x0Oaw
simd_runtime: CpuOps DTO (third dispatch pattern) + GCC-scraped CPU table
2 parents 8739d90 + d50caaf commit bc307ec

5 files changed

Lines changed: 519 additions & 1 deletion

File tree

.claude/knowledge/agnostic-surface-cpu-matrix.md

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Same set as `td-simd-cpu-dispatch-matrix.md` § "Master matrix — x86_64" and
2323
| Z5 | `znver5` / `Zen4Avx512` (same dispatch) | AMD 2024 | same as Z4 + minor uarch |
2424
| ARL | `arrowlake` / `ArrowLake` | Intel 2024 | AVX2+FMA + AVX-VNNI+VNNI-INT8 |
2525
| HSW | `x86-64-v3` / `HaswellAvx2` | Intel 2013→2021 | AVX2+FMA (no VNNI/AVX-512) |
26-
| A76 | `cortex-a76` / `A76DotProd` | ARMv8.2 (Pi 5, M1) | NEON+dotprod+bf16+fp16 |
26+
| A76 | `cortex-a76` / `A76DotProd` | ARMv8.2 (Pi 5) | NEON+dotprod+fp16 (no bf16 / i8mm — those are V8.6+, see § M) |
2727
| A72 | `cortex-a72` / `A72Fast` | ARMv8.0 (Pi 4) | NEON only (no dotprod) |
2828
| A53 | `cortex-a53` / `A53Baseline` | ARMv8.0 (Pi 3/Z2W) | NEON, lower IPC |
2929
| SCA | scalar fallback | wasm32/riscv/i686 | no SIMD |
@@ -530,6 +530,76 @@ verifies that no per-CPU regression has crept in vs the historical baseline:
530530
`crate::simd::*`, this table must grow a row. Reviewers should reject
531531
PRs that add a public symbol without a corresponding matrix entry.
532532
533+
## M. AArch64 ground-truth core enumeration (GCC source)
534+
535+
The matrix above uses three aarch64 columns (A53 / A72 / A76) that
536+
each cover a *dispatch tier*multiple physical cores share the same
537+
SIMD primitive set. The authoritative per-core feature membership is
538+
in GCC's `gcc/config/aarch64/aarch64-cores.def`, scraped 2026-05-21:
539+
540+
| Core | GCC arch | Explicit feature flags |
541+
|---|---|---|
542+
| **A53/A72/A76 tier** (baseline NEON, optional dotprod+fp16, NO bf16) | | |
543+
| `cortex-a53` | V8-A | `(CRC)` |
544+
| `cortex-a72` | V8-A | `(CRC)` |
545+
| `cortex-a76` | V8.2-A | `F16, RCPC, DOTPROD` |
546+
| `cortex-a78` | V8.2-A | `F16, RCPC, DOTPROD, SSBS, PROFILE` |
547+
| `cortex-x1` | V8.2-A | `F16, RCPC, DOTPROD, SSBS, PROFILE` |
548+
| `neoverse-n1`| V8.2-A | `F16, RCPC, DOTPROD, PROFILE` |
549+
| `apple-m1` | V8.5-A | `()` — V8.5 baseline includes F16+dotprod, NO bf16/i8mm |
550+
| **V8.6-A tier** (BF16 + I8MM via baseline) | | |
551+
| `apple-m2` | V8.6-A | `()` — V8.6 baselinebf16, i8mm, sve, sve2 |
552+
| `apple-m3` | V8.6-A | same |
553+
| `oryon-1` | V8.6-A | `CRYPTO, SM4, SHA3, F16` (Snapdragon X Elite/Plus) |
554+
| `ampere1` | V8.6-A | `F16, RNG, AES, SHA3` |
555+
| `ampere1a` | V8.6-A | `F16, RNG, AES, SHA3, SM4, MEMTAG` |
556+
| **V8.7-A tier** (baseline + LS64 + MOPS) | | |
557+
| `apple-m4` | V8.7-A | `()` |
558+
| `ampere1b` | V8.7-A | `F16, RNG, AES, SHA3, SM4, MEMTAG, CSSC` |
559+
| **V9.0-A tier** (SVE2 baseline + explicit bf16/i8mm) | | |
560+
| `cortex-a510`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
561+
| `cortex-a710`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
562+
| `cortex-a715`| V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
563+
| `cortex-x2` | V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
564+
| `cortex-x3` | V9-A | `SVE2_BITPERM, MEMTAG, I8MM, BF16` |
565+
| `neoverse-n2`| V9-A | `I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE` |
566+
| `neoverse-v2`| V9-A | `I8MM, BF16, SVE2_BITPERM, RNG, MEMTAG, PROFILE` (Graviton 4) |
567+
| `grace` | V9-A | `I8MM, BF16, SVE2_BITPERM, SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE` |
568+
| **V8.4-A SVE tier** (Graviton 3's odd one) | | |
569+
| `neoverse-v1`| V8.4-A | `SVE, I8MM, BF16, PROFILE, SSBS, RNG` |
570+
| **V9.2-A tier** (V9 + V8.7 features) | | |
571+
| `cortex-a520`| V9.2-A | `SVE2_BITPERM, MEMTAG` |
572+
| `cortex-a720`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
573+
| `cortex-a725`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
574+
| `cortex-x4` | V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
575+
| `cortex-x925`| V9.2-A | `SVE2_BITPERM, MEMTAG, PROFILE` |
576+
| `neoverse-n3`| V9.2-A | `SVE2_BITPERM, RNG, MEMTAG, PROFILE` |
577+
| `neoverse-v3`| V9.2-A | `SVE2_BITPERM, RNG, LS64, MEMTAG, PROFILE` |
578+
579+
**Dispatch tier mapping (which matrix column each core lands in):**
580+
581+
| Tier (matrix col.) | Cores |
582+
|---|---|
583+
| A53 | `cortex-a53`, older V8.0-A |
584+
| A72 | `cortex-a72`, V8.0-A + CRC |
585+
| A76 (V8.2 with dotprod+fp16, NO bf16/i8mm) | `cortex-a76`, `cortex-a78`, `cortex-x1`, `neoverse-n1`, `apple-m1` |
586+
| **(new tierV8.6+/V9 with bf16+i8mm)** | `apple-m2`+, `oryon-1` (Snapdragon X), `cortex-a510`+, `neoverse-n2`/`v2`/`grace`, `ampere1`+ |
587+
| **(new tierV8.4-A + SVE + bf16+i8mm)** | `neoverse-v1` (Graviton 3only V8.4-A core with explicit SVE+bf16+i8mm) |
588+
589+
The matrix's three aarch64 columns cover the bottom of the dispatch
590+
ladder. The bf16/i8mm tier (which would carry NEON BFMMLA / BFDOT /
591+
USDOT / FMLA.8h) needs its own column in a future revisionwhen the
592+
NEON BF16 asm-byte arm lands (Phase 3b in § J), every V8.6+ core
593+
listed above gets covered by the same dispatch arm.
594+
595+
**Source provenance:** scraped from
596+
`https://raw.githubusercontent.com/gcc-mirror/gcc/master/gcc/config/aarch64/aarch64-cores.def`
597+
(GCC trunk, 2026-05-21). The `AARCH64_CORE(...)` macro emits the
598+
canonical namearchfeature-string mapping; GCC's
599+
`(define_insn ...)` patterns in `aarch64-simd.md` give the bit
600+
encodings for the asm-byte rule (`.inst 0xXXXXXXXX`) that Phase 3b
601+
will use for BFMMLA / BFDOT / FMLA.8h / USDOT.
602+
533603
## L. Provenance
534604
535605
- CPU feature presence: sourced from `td-simd-cpu-dispatch-matrix.md`.

src/simd_runtime/add_mul.rs

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,51 @@ unsafe fn add_mul_f64_scalar(acc: &mut [f64], a: &[f64], b: &[f64]) {
228228
}
229229
}
230230

231+
// ────────────────────────────────────────────────────────────────────────
232+
// CpuOps DTO entry points — pub(super) wrappers for cpu_ops.rs to
233+
// reference the tier-specific kernels by name in static const decls.
234+
// Each one has the safety invariant guaranteed by the cpu_ops()
235+
// LazyLock that installed the parent &'static CpuOps.
236+
// ────────────────────────────────────────────────────────────────────────
237+
238+
#[cfg(target_arch = "x86_64")]
239+
pub(super) unsafe fn add_mul_f32_avx512_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
240+
add_mul_f32_avx512(acc, a, b)
241+
}
242+
243+
#[cfg(target_arch = "x86_64")]
244+
pub(super) unsafe fn add_mul_f64_avx512_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
245+
add_mul_f64_avx512(acc, a, b)
246+
}
247+
248+
#[cfg(target_arch = "x86_64")]
249+
pub(super) unsafe fn add_mul_f32_avx2_fma_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
250+
add_mul_f32_avx2_fma(acc, a, b)
251+
}
252+
253+
#[cfg(target_arch = "x86_64")]
254+
pub(super) unsafe fn add_mul_f64_avx2_fma_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
255+
add_mul_f64_avx2_fma(acc, a, b)
256+
}
257+
258+
#[cfg(target_arch = "aarch64")]
259+
pub(super) unsafe fn add_mul_f32_neon_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
260+
add_mul_f32_neon(acc, a, b)
261+
}
262+
263+
#[cfg(target_arch = "aarch64")]
264+
pub(super) unsafe fn add_mul_f64_neon_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
265+
add_mul_f64_neon(acc, a, b)
266+
}
267+
268+
pub(super) unsafe fn add_mul_f32_scalar_safe(acc: &mut [f32], a: &[f32], b: &[f32]) {
269+
add_mul_f32_scalar(acc, a, b)
270+
}
271+
272+
pub(super) unsafe fn add_mul_f64_scalar_safe(acc: &mut [f64], a: &[f64], b: &[f64]) {
273+
add_mul_f64_scalar(acc, a, b)
274+
}
275+
231276
#[cfg(test)]
232277
mod tests {
233278
use super::*;

0 commit comments

Comments
 (0)