Skip to content

Commit bede3d2

Browse files
committed
docs(simd): flip MX-T1a cells + lock in asm-byte rule for AMX/F16
Two updates to the agnostic-surface CPU matrix following the MX-T1a landing (b5bca4e) and the user directive on instruction encoding strategy: 1. Matrix § C cells flipped from ⚠️ scalar → ✅ for add_i8 / sub_i8 / add_i16 across every CPU column. The path per backend is documented inline (zmm _mm512_add_epi8 on AVX-512-BW, 2× ymm _mm256_add_epi8 on AVX2 via I8x64 polyfill, vaddq_s8 on NEON, scalar wrapping_add elsewhere). 2. § J Phase 0 grows an entry for MX-T1a, and gains a NEW "Design rule for AMX / F16 / FP16 paths" subsection that codifies the asm-byte encoding requirement for Phases 1b (AMX-INT8 arm of gemm_u8_i8), 3b (AVX-512-FP16 native F16x16 ops), 3c (NEON BF16+FP16), and 4d (AMX-FP16 on GNR). The rule: * AMX intrinsics are nightly-only on Rust 1.95 (issue #126622) → use asm!(".byte 0xc4, 0xe2, 0x73, 0x5e, 0xc1") style per the existing simd_amx.rs pattern. * AVX-512-FP16 intrinsics have stabilization churn → same asm-byte encoding sidesteps Rust release dance. * NEON FP16 (FMLA v.8h, BFDOT, BFMMLA, USDOT) — historically nightly-gated, use .inst 0x0e40cc20-style encoding for AArch64 (same idea, different assembler directive). * Each newly-encoded instruction lands with an objdump -d verification check in the doc-comment ("verified working" — same convention as simd_amx.rs:16-19). * Does NOT apply to instructions WITH stable intrinsics on Rust 1.95: _mm512_dpbusd_epi32 (avx512vnni), F16C _mm256_cvtph_ps, _mm512_cvtne2ps2bf16 (avx512bf16), etc. Those continue using direct intrinsics per existing simd_avx512.rs patterns. The rule prevents future regression where a session reaches for nightly avx512fp16 intrinsics, fails to compile on the project's stable toolchain, and then drops back to scalar polyfill — the same shape of regression that removed array_windows/add_mul in the prior session and was recovered in 0a46e7f. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent b5bca4e commit bede3d2

1 file changed

Lines changed: 73 additions & 3 deletions

File tree

.claude/knowledge/agnostic-surface-cpu-matrix.md

Lines changed: 73 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,9 +163,9 @@ op on `I32x16`, see § J integration plan Phase 4).
163163

164164
| Function | SKX | CLX | CPL | ICX | SPR | GNR | Z4 | Z5 | ARL | HSW | A76 | A72 | A53 | SCA |
165165
|--------------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|------------|-----|
166-
| `add_i8` | ✗ scalar 🚨 || | | | | | || || | | scalar |
167-
| `sub_i8` | ✗ scalar 🚨 | | | | | | | | | | | | | |
168-
| `add_i16` | ✗ scalar 🚨 | | | | | | || || | || |
166+
| `add_i8` ✅ MX-T1a |`_mm512_add_epi8` via `I8x64` ||||||||`_mm256_add_epi8` ×2 via `I8x64` polyfill ||`vaddq_s8` via `I8x16` ||| scalar wrapping_add |
167+
| `sub_i8` ✅ MX-T1a |`_mm512_sub_epi8` ||||||||`_mm256_sub_epi8` ×2 ||`vsubq_s8` ||| ✅ scalar wrapping_sub |
168+
| `add_i16` ✅ MX-T1a|`_mm512_add_epi16` via `I16x32` ||||||||`_mm256_add_epi16` via `I16x32` polyfill ||`vaddq_s16` via `I16x8` ||| ✅ scalar wrapping_add |
169169
| `dot_i8` | ✗ scalar 🚨 ||||||||||||||
170170
| `dot_i16` | ✗ scalar 🚨 ||||||||||||||
171171
| `min_i8` |`vpminsb zmm` via I8x64 ||||||||`vpminsb ymm` via I8x32 polyfill of I8x64 ||`vminq_s8` via I8x16 ||| ✗ scalar loop |
@@ -328,6 +328,76 @@ Filling the matrix in deliberate phases. Each item is one PR-sized unit.
328328
-`simd_ops::add_mul_f32` + `add_mul_f64` (slice-level FMA, polyfill-routed).
329329
- ✅ "Foundation primitives — do not remove" doc-callout in `simd_ops.rs`.
330330
- ✅ Bench harness (`bench_gemm_u8_i8_vs_scalar`, `#[ignore]`'d).
331+
- ✅ MX-T1a — `add_i8` / `sub_i8` / `add_i16` lifted from scalar to polyfilled
332+
`I8x64` / `I8x16` / `I16x32` / `I16x8` (matrix § C cells flipped).
333+
334+
### Design rule for AMX / F16 / FP16 paths: inline asm-byte encoding
335+
336+
> **Hard constraint for Phases 1b (AMX-INT8), 3b (AVX-512-FP16),
337+
> 3c (NEON BF16+FP16), 4d (AMX-FP16):** every instruction that lacks
338+
> stable Rust intrinsics on the project's pinned 1.95 stable toolchain
339+
> MUST be emitted via raw-`.byte`-string inline asm, matching the
340+
> pattern already proven in `src/simd_amx.rs` (lines 16-19 of its
341+
> module docs). Rationale:
342+
>
343+
> 1. **AMX intrinsics are nightly-only** (Rust issue #126622). The
344+
> project pins Rust 1.95 stable per `CLAUDE.md` line 9. The
345+
> existing `simd_amx.rs` lifts AMX onto stable today via
346+
> `asm!(".byte 0xc4, 0xe2, 0x7b, 0x49, 0xc0", options(nostack, nomem))`
347+
> for TILEZERO and equivalent encodings for TDPBUSD / TDPBF16PS.
348+
> 2. **AVX-512-FP16 intrinsics** (`_mm512_add_ph`, `_mm512_fmadd_ph`,
349+
> `vcvtph2ps`/`vcvtps2ph` zmm forms) — historically have had
350+
> stabilization churn. Asm-byte encoding skips the version dance.
351+
> 3. **NEON FP16** (FMLA `v.8h`, BFDOT, BFMMLA, USDOT) — likewise
352+
> nightly-gated for several Rust releases. The existing
353+
> `simd_neon_bf16.rs` and `simd_neon_dotprod.rs` stub files (TD-T10
354+
> / TD-T11) are placeholders meant to be filled with asm-byte
355+
> encodings per the same pattern.
356+
>
357+
> Concrete recipe:
358+
>
359+
> ```rust
360+
> #[cfg(target_arch = "x86_64")]
361+
> #[target_feature(enable = "amx-tile,amx-int8")]
362+
> unsafe fn tdpbusd_t0_t1_t2() {
363+
> // TDPBUSD tmm0, tmm1, tmm2 — opcode VEX C4 E2 73 5E C1
364+
> // 5E = TDPBUSD, prefix bits = unsigned-by-signed selector
365+
> // C1 = ModR/M (tmm0 dest, tmm1 src1, tmm2 src2 via /r encoding)
366+
> // The byte sequence is the canonical VEX form documented in
367+
> // Intel SDM Vol. 2D § TDPBUSD; verify with `objdump -d` of a
368+
> // gas-assembled stub the first time it lands.
369+
> core::arch::asm!(
370+
> ".byte 0xc4, 0xe2, 0x73, 0x5e, 0xc1",
371+
> options(nostack, nomem)
372+
> );
373+
> }
374+
> ```
375+
>
376+
> Same pattern for NEON F16:
377+
>
378+
> ```rust
379+
> #[cfg(target_arch = "aarch64")]
380+
> #[target_feature(enable = "neon,fp16")]
381+
> unsafe fn fmla_v8h(_acc: &mut float16x8_t, _a: float16x8_t, _b: float16x8_t) {
382+
> // FMLA v0.8h, v1.8h, v2.8h — encoding 0x0e40_cc20 | (Rd << 0) | (Rn << 5) | (Rm << 16)
383+
> // Same byte-encoded pattern as simd_amx.rs uses for AMX on x86.
384+
> core::arch::asm!(
385+
> ".inst 0x0e42cc20", // FMLA v0.8h, v1.8h, v2.8h
386+
> options(nostack, nomem)
387+
> );
388+
> }
389+
> ```
390+
>
391+
> **Verification harness:** each newly-encoded instruction lands with an
392+
> `objdump -d` check in the doc-comment showing the gas-disassembly
393+
> matches the intended mnemonic. The first such verification in this
394+
> project is recorded in `simd_amx.rs:16-19` ("verified working" line).
395+
>
396+
> **What this rule does NOT apply to:** instructions with already-stable
397+
> intrinsics on Rust 1.95 — `_mm512_dpbusd_epi32` (avx512vnni),
398+
> `_mm256_dpbusd_avx_epi32` (avxvnni), `_mm256_cvtph_ps` (F16C),
399+
> `_mm512_cvtne2ps2bf16` (avx512bf16). Those continue to use the
400+
> intrinsics directly per the existing `simd_avx512.rs` patterns.
331401
332402
### Phase 1 — Wire what already exists (highest ROI per audit)
333403

0 commit comments

Comments
 (0)