Skip to content

Commit 0d00318

Browse files
committed
feat(simd): Phase 1 — explicit cargo configs + AVX2 dispatch hardening
Implements Phase 1 of the integration plan in `.claude/knowledge/ simd-dispatch-architecture.md` (PR #171). Changes ------- 1. `.cargo/config.toml` — set `target-cpu = "x86-64-v3"` for x86_64. Previously the file declared "no global target-cpu", which compiled binaries to x86-64 generic (SSE2). `simd_avx2::F32x16` and friends wrap `__m256` / `__m256i` intrinsics that the runtime CPU never executes under SSE2, producing the PR #170 SIGILL CI mode (38 tests timing out uniformly at ~19s in `simd_avx2::*` / `simd_ops::*` / `simd_soa::*`). 2. `.cargo/config-avx512.toml` (new) — explicit `x86-64-v4` for AVX-512 builds. Triggered by `cargo --config .cargo/config-avx512.toml`. 3. `.cargo/config-native.toml` (new) — `target-cpu = "native"` for build-host-tuned binaries (developer machines). Non-portable. 4. `src/simd.rs` — tighten the AVX2 dispatch arm predicate from `not(target_feature = "avx512f")` to `target_feature = "avx2" + not(target_feature = "avx512f")`. Belts-and-braces: under v3 the predicates are equivalent, but the explicit `avx2` requirement means a future "build me without v3" invocation lands on a compile error rather than a SIGILL at run time. Stale "target-cpu=x86-64-v4 → AVX-512" comment refreshed to describe the new three-config dispatch model. Out of scope for this PR ------------------------ The architecture doc (PR #171) claimed Phase 1 also needed to "add ~10 missing AVX2 two-half wrappers". On survey those wrappers already exist in `src/simd_avx2.rs`: - `F32x16` / `F64x8` — true two-half AVX wrappers - `U8x32` — native AVX2 `__m256i` - `U8x64` / `I8x64` / `I16x32` / `I32x16` / `I64x8` / `U16x32` / `U32x16` / `U64x8` — scalar polyfill via the `avx2_int_type!` macro (storage = `[$elem; $lanes]` align 64). The matrix in the architecture doc will be corrected as a follow-up. The parity gap that does exist (scalar-polyfill ints are not vectorized under AVX2) is its own piece of tech debt, tracked separately.
1 parent 207fc20 commit 0d00318

4 files changed

Lines changed: 73 additions & 9 deletions

File tree

.cargo/config-avx512.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[build]
2+
# Explicit AVX-512 config — `x86-64-v4`. Use with:
3+
# cargo --config .cargo/config-avx512.toml build
4+
# cargo --config .cargo/config-avx512.toml test
5+
#
6+
# Compiles `target_feature = "avx512f"` on, so `src/simd.rs` selects the
7+
# `simd_avx512` backend with native `__m512` / `__m512d` / `__m512i`
8+
# storage. Required for the Sapphire Rapids / Granite Rapids hot paths
9+
# (`f32_to_bf16_batch_rne`, the AVX-512BF16 BF16 lanes, the AMX tiles).
10+
#
11+
# Binary produced here will SIGILL on AVX2-only silicon — only use on
12+
# hosts that report `avx512f` in `/proc/cpuinfo`. For shipping a single
13+
# release artifact that adapts at process start, see the LazyLock runtime
14+
# dispatch path in § 7.1 of the architecture doc instead.
15+
[target.'cfg(target_arch = "x86_64")']
16+
rustflags = ["-Ctarget-cpu=x86-64-v4"]

.cargo/config-native.toml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
[build]
2+
# Native build config — `target-cpu = "native"`. Use with:
3+
# cargo --config .cargo/config-native.toml build
4+
# cargo --config .cargo/config-native.toml test
5+
#
6+
# rustc resolves the build host's CPUID at invocation and enables every
7+
# `target_feature` the host CPU advertises. `simd.rs` then picks the
8+
# matching backend (typically `simd_avx512` on modern dev machines).
9+
#
10+
# Produces a binary tuned for the developer's exact silicon. The result
11+
# is NOT portable: do not distribute artifacts built with this config.
12+
[target.'cfg(target_arch = "x86_64")']
13+
rustflags = ["-Ctarget-cpu=native"]

.cargo/config.toml

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,26 @@
11
[build]
2-
# No global target-cpu. Each kernel uses #[target_feature(enable = "avx512f")]
3-
# per-function, with LazyLock runtime detection. One binary, all ISAs.
4-
# Railway (AVX-512) and GitHub CI (AVX2) use the same binary.
2+
# Default cargo config — x86-64-v3 (AVX2) baseline. Portable across all
3+
# x86_64 silicon shipping since ~2013 (Haswell+). This is what GitHub CI
4+
# runs against and what `cargo build` produces for general distribution.
5+
#
6+
# Why v3 and not "no target-cpu":
7+
# `src/simd_avx2.rs` composes `F32x16` as two `__m256` halves (AVX
8+
# intrinsics), and the `simd_avx2_*` op funcs use `__m256i` (AVX2).
9+
# Without a global v3 baseline, rustc compiles to x86-64 generic (SSE2)
10+
# and those intrinsics emit instructions the CPU never executes →
11+
# SIGILL at run time, exactly the PR #170 CI failure mode.
12+
#
13+
# AVX-512 builds: use `--config .cargo/config-avx512.toml` (or
14+
# `CARGO_BUILD_RUSTFLAGS='-Ctarget-cpu=x86-64-v4'`). The simd.rs dispatch
15+
# arms key off `target_feature = "avx512f"`; under v4 they pick the
16+
# `simd_avx512` backend (native `__m512` / `__m512d` / `__m512i`).
17+
#
18+
# Build-machine-tuned binaries: use `--config .cargo/config-native.toml`
19+
# (`target-cpu = "native"`); rustc resolves the host CPUID at compile.
20+
#
21+
# Runtime LazyLock dispatch (one release binary, heterogeneous deployment
22+
# silicon) is a fifth opt-in mode — see § 7.1 of
23+
# .claude/knowledge/simd-dispatch-architecture.md. Reserved for the
24+
# release-binary distribution path; never the dev / CI default.
25+
[target.'cfg(target_arch = "x86_64")']
26+
rustflags = ["-Ctarget-cpu=x86-64-v3"]

src/simd.rs

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -198,10 +198,17 @@ pub const PREFERRED_I16_LANES: usize = 16;
198198
// x86_64: re-export based on tier
199199
// ============================================================================
200200

201-
// Compile-time AVX-512 dispatch via target_feature.
202-
// With target-cpu=x86-64-v4 (.cargo/config.toml), avx512f is enabled
203-
// at compile time → all types use native __m512/__m512d/__m512i.
204-
// The 256-bit types (F32x8, F64x4) also live in simd_avx512 (__m256).
201+
// Compile-time SIMD dispatch via target_feature. The cargo config
202+
// chosen at build (.cargo/config.toml = v3 default / config-avx512.toml
203+
// = v4 / config-native.toml = native) sets the `target_feature` flags
204+
// that select exactly one arm below.
205+
// * v3 / GitHub-CI default → `target_feature = "avx2"` only →
206+
// simd_avx2 backend (F32x16 = two-half (f32x8, f32x8), int wrappers
207+
// are scalar polyfills via the `avx2_int_type!` macro).
208+
// * v4 (or native on AVX-512 host) → `target_feature = "avx512f"` →
209+
// simd_avx512 backend with native __m512 / __m512d / __m512i.
210+
// * aarch64 → simd_neon backend.
211+
// * everything else (wasm32, riscv, etc.) → scalar fallback.
205212

206213
// Note on the `nightly-simd` feature: it adds the `crate::simd_nightly`
207214
// module (a portable-simd backend wrapping `core::simd`) but does NOT
@@ -272,10 +279,16 @@ pub use crate::simd_avx512::{f32_to_bf16_batch_rne, f32_to_bf16_scalar_rne};
272279
#[cfg(all(target_arch = "x86_64", target_feature = "avx512bf16"))]
273280
pub use crate::simd_avx512::{BF16x16, BF16x8};
274281

275-
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f")))]
282+
// AVX2 baseline arm — selected by the `x86-64-v3` cargo default. Requires
283+
// `target_feature = "avx2"` explicitly: building x86_64-without-AVX2 (the
284+
// generic `x86-64` baseline = SSE2) would otherwise pick this arm and
285+
// then SIGILL on the `__m256` / `__m256i` intrinsics inside the wrappers.
286+
// Whoever wants no-AVX2 must pick the scalar fallback path (currently
287+
// non-x86 only — see TD-SIMD-7 in the architecture doc).
288+
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f")))]
276289
pub use crate::simd_avx512::{f32x8, f64x4, i16x16, i8x32, F32x8, F64x4, I16x16, I8x32};
277290

278-
#[cfg(all(target_arch = "x86_64", not(target_feature = "avx512f")))]
291+
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f")))]
279292
pub use crate::simd_avx2::{
280293
f32x16, f64x8, i16x32, i32x16, i64x8, i8x64, u32x16, u64x8, u8x64, F32Mask16, F32x16, F64Mask8, F64x8, I16x32,
281294
I32x16, I64x8, I8x64, U16x32, U32x16, U64x8, U8x64,

0 commit comments

Comments
 (0)