Commit da66c21

committed

docs(simd): integration plan — SimdProfile + static dispatch tables

Companion to td-simd-tier-audit.md. Architecture decision for intra-bucket dispatch (SPR vs ICX vs CLX within Tier::Avx512, A76 vs A72 within Tier::Neon). Core proposal: SimdProfile enum (SapphireRapids, GraniteRapids, IceLakeSp, CooperLake, CascadeLake, SkylakeX, Zen4Avx512, ArrowLake, HaswellAvx2, A76DotProd, A72Fast, A53Baseline, Scalar) with one static *Dispatch table per profile. Detect once via LazyLock<SimdProfile>, then every dispatch is one pointer deref + one indirect call. Branch predictor warms once. Compile-time pinning via cargo features (cpu-spr / cpu-icx / cpu-zen4 / etc.) shortcuts the LazyLock to a const &'static *Dispatch reference; rustc folds the entire dispatch out at the callsite. Same source, two ergonomics. Four phases: Phase 1 (~10-12h, P0): Wire existing kernels through existing dispatch sites. Closes 7 CRITICAL audit findings (TD-T1..T7). Pure routing, no new abstractions. Phase 2 (~22h, P1): Real NEON BF16/F16/integer types via asm-byte BFMMLA/BFDOT/fmla.8h. Closes TD-T8, T10, T11, T21. Needs Pi 5 / M2 hardware verification. Phase 3 (~21h, P1): SimdProfile architecture. Closes the three Tier enum collapses (TD-T12, T13, T14). Migration is additive — simd_profile() ships alongside tier(); the three local enums get deleted last. Phase 4 (30-50h, P2, parallelizable): Intra-bucket fills. 13 named tasks listed with profile-unlock and target file. Quick reference at the bottom shows, for bf16_gemm_f32, int8_gemm_i32, gemv_f32, BF16x16, F16x16, the silicon-by-silicon route after all 4 phases land. This is the "what to dispatch where" the user asked for. 7 risks / open questions at the end. Most important: - cpu-* features must be mutually exclusive (compile_error!) AND documented as non-portable. - caps.amx_fp16 doesn't exist yet — add before Phase 3 (CPUID leaf 7,1 EAX bit 21 for GraniteRapids dispatch). - BLAS-1/2/3, LAPACK, statistics, quantized.rs grep showed NO crate::simd::* use. Flagship public surfaces are scalar. Phase 4 must decide: route through dispatch tables, or rewrite in place. No code changes in this commit. Documentation only.

1 parent ca82bb7 commit da66c21Copy full SHA for da66c21

1 file changed

.claude/knowledge
- td-simd-integration-plan.md

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit da66c21

File tree

0 commit comments