Commit a13d96e
committed
feat(simd_ops): LazyLock-dispatched add_mul + vnni_dot + is_amx_available
Rebased onto master post-#181, #182, #183. Replaces the polyfill-based
add_mul_f32/f64 with LazyLock-cached function pointers picking real
hardware FMA per silicon, and adds two more LazyLock-cached
primitives the consumer needs: is_amx_available() and vnni_dot_u8_i8.
WHY: F32x16::mul_add on AVX2 builds drops to per-lane scalar
f32::mul_add (simd_avx2.rs:586). The polyfill abstracts lane width
but cannot pick between _mm256_fmadd_ps and _mm512_fmadd_ps — that
is an instruction-family choice, not a lane-width one. LazyLock
amortises a one-time simd_caps() read into a frozen fn pointer;
every subsequent call is a single indirect jump with zero
is_x86_feature_detected! overhead. No SimdProfile exposed at the
consumer surface — agnostic contract preserved.
add_mul_f32(acc, a, b) — acc[i] += a[i]*b[i]
AVX-512F+FMA → _mm512_fmadd_ps 16-wide + 8-wide tail + scalar tail
AVX2+FMA → _mm256_fmadd_ps 8-wide + scalar tail
NEON → vfmaq_f32 4-wide + scalar tail
scalar → f32::mul_add per lane
no_std build → preserves the polyfill F32x16::mul_add path
(LazyLock requires std)
add_mul_f64(acc, a, b) — f64 sibling, same shape with 8/4/2 lanes.
is_amx_available() — wraps simd_amx::amx_available() (CPUID +
OSXSAVE + XCR0[17,18] + Linux arch_prctl(XCOMP_PERM)) in
LazyLock<bool>. The 4-step gate, including the syscall, fires
exactly once per process. Always false on non-x86_64.
vnni_dot_u8_i8(a, b) — i32 dot of u8 × i8 slices:
AVX-512 VNNI → delegates to simd_amx::vnni_dot_u8_i8 wrapped with
scalar tail handling (the existing kernel processes
only n - (n%64) since its cognitive-shader caller
pre-aligns rows; general-purpose callers need the
tail)
AVX-VNNI 256 → delegates to simd_amx::vnni2_dot_u8_i8 directly
(that one already handles its scalar tail)
scalar → simd_amx::vnni_dot_u8_i8_scalar
No intrinsic code is duplicated. The dispatcher composes existing
simd_amx::* kernels (which #182/#184 also build on) into a safe
LazyLock-cached consumer-facing wrapper. simd_amx::matvec_dispatch
runs the same selection logic but uses is_x86_feature_detected! per
call; this wrapper amortises that to once at startup.
PARITY CONTRACT:
- add_mul_f32 / add_mul_f64: bit-identical to f32::mul_add /
f64::mul_add per lane via to_bits() assertion. All vector
backends emit single-rounded IEEE-754 FMA.
- vnni_dot_u8_i8: bit-identical i32 to scalar widen-and-multiply.
VPDPBUSD does not saturate the accumulator (intermediate u8*i8
products bounded by 32385, four-element sums by 129540).
Tests: 2101/2101 lib pass (7 new lazylock_dispatch_tests over 12
problem sizes / tail lengths). cargo clippy --lib clean under
default and --features cpu-spr. On Sapphire Rapids host the
LazyLock resolved to AVX-512+FMA for add_mul, AVX-512 VNNI for
vnni_dot; AMX is_amx_available returns false (hypervisor masks
XCR0[17,18]) — matches the Risk #3 demotion from 61b4563.
This commit was rebased atop master after the parallel session
shipped PR #182 (BF16 AMX tile kernels), #183 (F16C cast batch), and
prepared #184 (TDPBUSD int8 tile + matmul_i8_to_i32 wiring). The
earlier 469ecc7 (coarse + SimdTier) and 77e3971 (mul_add_f32_into +
walkback) and be65595 (is_amx_available + vnni_dot duplicating
intrinsics) are subsumed by this single clean commit: no public
SimdProfile / SimdTier re-export, no duplicated intrinsic code, no
mul_add_f32_into (master's add_mul_f32 shape is the right primitive).1 parent 61b4563 commit a13d96e
1 file changed
Lines changed: 516 additions & 14 deletions
0 commit comments