Commit 38d4800
committed
feat(hpc): VPDPBUSD-ymm AVX-VNNI tier for matmul_i8_to_i32
Completes the per-CPU dispatch chain for `matmul_i8_to_i32` by
adding the AVX-VNNI ymm tier — Arrow Lake, Meteor Lake U, Alder
Lake silicon that has AVX-VNNI but dropped AVX-512. Mirrors the
shape of the avx512vnni-zmm arm shipped in PR #184 with the
narrower 8-wide kernel.
New kernel `hpc::int8_tile_gemm::int8_gemm_vpdpbusd_ymm`:
* One `_mm256_dpbusd_avx_epi32` instruction: 8 i32 accumulator
lanes, each receiving 4 u8×i8 products = 32 MACs per
instruction. Half the throughput-per-instruction of the
`_mm512_dpbusd_epi32` zmm version.
* Same B-pre-pack scheme (quad-interleaved per 8-wide j-block),
same K-tail / N-tail handling. Just narrower.
* Stable intrinsic under `target_feature = "avxvnni,avx2"` — no
asm-byte needed.
Wiring `matmul_i8_to_i32`'s dispatch as Tier 3:
1. amx_available() + 16/16/64-aligned → AMX TDPBUSD
(PR #184: int8_gemm_amx_tiled, 16 384 MACs/instr)
2. is_x86_feature_detected!("avx512vnni") → VPDPBUSD-zmm
(PR #184: int8_gemm_vpdpbusd_zmm, 64 MACs/instr)
3. is_x86_feature_detected!("avxvnni") → VPDPBUSD-ymm
(THIS COMMIT: int8_gemm_vpdpbusd_ymm, 32 MACs/instr)
4. scalar i8×i8 → i32 reference (was Tier 3)
All three SIMD tiers share the sign-shift bias trick: shift LHS
i8 → u8 (+128), run the kernel, subtract 128·colsum(B). Same
`subtract_i8_to_u8_bias` helper (factored in PR #184).
New direct test `vpdpbusd_ymm_matches_scalar` mirrors the zmm
version's test: sweeps shapes spanning 8-aligned, K-tail (k % 4),
N-tail (n % 8), and small shapes, asserts byte-equal output vs
scalar reference.
Verification:
* Default v3 (this host has avx512vnni so the new arm doesn't
fire from matmul_i8_to_i32 — Tier 2 catches first): 2096 lib
tests pass (was 2095 — +1 new direct test).
* Direct test exercises int8_gemm_vpdpbusd_ymm on this host
since avxvnni is present alongside avx512vnni.
* cargo clippy --lib --tests --features rayon,native -- -D warnings
clean.
* cargo fmt --all --check clean.
Per-CPU dispatch state after this commit (final on the int8 side):
matmul_i8_to_i32: SPR+ AMX | CPL/Zen4 zmm | ARL ymm | scalar
(PR #184) | (PR #184) | (THIS) | (always)
The matmul_i8_to_i32 column of PR #180's dispatch table is now
fully filled. The gemm_u8_i8 slice surface (in PR #185) already
has AVX-VNNI ymm via its existing compile-time cascade — both
i8-related public surfaces now cover every x86_64 tier with a
hardware-accelerated arm.
Out of scope (separate PRs):
* NEON BFMMLA / SDOT on aarch64 via asm-byte — Phase 3b, needs
aarch64 CI runner verification.
* TD-T6: real _mm256_* for AVX2 BLAS-1 (scal/nrm2/asum).
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u1 parent f8e9453 commit 38d4800
2 files changed
Lines changed: 142 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
621 | 621 | | |
622 | 622 | | |
623 | 623 | | |
| 624 | + | |
| 625 | + | |
| 626 | + | |
| 627 | + | |
| 628 | + | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
624 | 634 | | |
625 | | - | |
626 | | - | |
| 635 | + | |
| 636 | + | |
627 | 637 | | |
628 | 638 | | |
629 | 639 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
215 | 215 | | |
216 | 216 | | |
217 | 217 | | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
218 | 309 | | |
219 | 310 | | |
220 | 311 | | |
| |||
422 | 513 | | |
423 | 514 | | |
424 | 515 | | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
425 | 555 | | |
426 | 556 | | |
427 | 557 | | |
| |||
0 commit comments