Commit da66c21
committed
docs(simd): integration plan — SimdProfile + static dispatch tables
Companion to td-simd-tier-audit.md. Architecture decision for
intra-bucket dispatch (SPR vs ICX vs CLX within Tier::Avx512,
A76 vs A72 within Tier::Neon).
Core proposal: SimdProfile enum (SapphireRapids, GraniteRapids,
IceLakeSp, CooperLake, CascadeLake, SkylakeX, Zen4Avx512,
ArrowLake, HaswellAvx2, A76DotProd, A72Fast, A53Baseline,
Scalar) with one static *Dispatch table per profile. Detect
once via LazyLock<SimdProfile>, then every dispatch is one
pointer deref + one indirect call. Branch predictor warms
once.
Compile-time pinning via cargo features (cpu-spr / cpu-icx /
cpu-zen4 / etc.) shortcuts the LazyLock to a const &'static
*Dispatch reference; rustc folds the entire dispatch out at
the callsite. Same source, two ergonomics.
Four phases:
Phase 1 (~10-12h, P0): Wire existing kernels through existing
dispatch sites. Closes 7 CRITICAL audit findings (TD-T1..T7).
Pure routing, no new abstractions.
Phase 2 (~22h, P1): Real NEON BF16/F16/integer types via
asm-byte BFMMLA/BFDOT/fmla.8h. Closes TD-T8, T10, T11, T21.
Needs Pi 5 / M2 hardware verification.
Phase 3 (~21h, P1): SimdProfile architecture. Closes the three
Tier enum collapses (TD-T12, T13, T14). Migration is additive
— simd_profile() ships alongside tier(); the three local
enums get deleted last.
Phase 4 (30-50h, P2, parallelizable): Intra-bucket fills.
13 named tasks listed with profile-unlock and target file.
Quick reference at the bottom shows, for bf16_gemm_f32,
int8_gemm_i32, gemv_f32, BF16x16, F16x16, the silicon-by-silicon
route after all 4 phases land. This is the "what to dispatch
where" the user asked for.
7 risks / open questions at the end. Most important:
- cpu-* features must be mutually exclusive (compile_error!)
AND documented as non-portable.
- caps.amx_fp16 doesn't exist yet — add before Phase 3 (CPUID
leaf 7,1 EAX bit 21 for GraniteRapids dispatch).
- BLAS-1/2/3, LAPACK, statistics, quantized.rs grep showed NO
crate::simd::* use. Flagship public surfaces are scalar.
Phase 4 must decide: route through dispatch tables, or
rewrite in place.
No code changes in this commit. Documentation only.1 parent ca82bb7 commit da66c21
1 file changed
Lines changed: 479 additions & 0 deletions
0 commit comments