Skip to content

Commit a13d96e

Browse files
committed
feat(simd_ops): LazyLock-dispatched add_mul + vnni_dot + is_amx_available
Rebased onto master post-#181, #182, #183. Replaces the polyfill-based add_mul_f32/f64 with LazyLock-cached function pointers picking real hardware FMA per silicon, and adds two more LazyLock-cached primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8. WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that is an instruction-family choice, not a lane-width one. LazyLock amortises a one-time simd_caps() read into a frozen fn pointer; every subsequent call is a single indirect jump with zero is_x86_feature_detected! overhead. No SimdProfile exposed at the consumer surface — agnostic contract preserved. add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i] AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail NEON → vfmaq_f32 4-wide + scalar tail scalar → f32::mul_add per lane no_std build → preserves the polyfill F32x16::mul_add path (LazyLock requires std) add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes. is_amx_available() — wraps simd_amx::amx_available() (CPUID + OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in LazyLock<bool>. The 4-step gate, including the syscall, fires exactly once per process. Always false on non-x86_64. vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices: AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with scalar tail handling (the existing kernel processes only n - (n%64) since its cognitive-shader caller pre-aligns rows; general-purpose callers need the tail) AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly (that one already handles its scalar tail) scalar → simd_amx::vnni_dot_u8_i8_scalar No intrinsic code is duplicated. The dispatcher composes existing simd_amx::* kernels (which #182/#184 also build on) into a safe LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch runs the same selection logic but uses is_x86_feature_detected! per call; this wrapper amortises that to once at startup. PARITY CONTRACT: - add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add / f64::mul_add per lane via to_bits() assertion. All vector backends emit single-rounded IEEE-754 FMA. - vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply. VPDPBUSD does not saturate the accumulator (intermediate u8*i8 products bounded by 32385, four-element sums by 129540). Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12 problem sizes / tail lengths). cargo clippy --lib clean under default and --features cpu-spr. On Sapphire Rapids host the LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for vnni_dot; AMX is_amx_available returns false (hypervisor masks XCR0[17,18]) — matches the Risk #3 demotion from 61b4563. This commit was rebased atop master after the parallel session shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into + walkback) and be65595 (is_amx_available + vnni_dot duplicating intrinsics) are subsumed by this single clean commit: no public SimdProfile / SimdTier re-export, no duplicated intrinsic code, no mul_add_f32_into (master's add_mul_f32 shape is the right primitive).
1 parent 61b4563 commit a13d96e

1 file changed

Lines changed: 516 additions & 14 deletions

File tree

0 commit comments

Comments
 (0)