Skip to content

Commit 8b1d376

Browse files
committed
docs(pr-x12): purge runtime-dispatch creep from woa-multiarch §3
The previous §3 violated the project's W1a polyfill contract by describing runtime CPU dispatch as if it were the substrate's design: - §3.1 listed multiple #[target_feature(enable = ...)] fns coexisting in one mod, framed as "compile-time" but read like "available to select from at runtime". - §3.2 had an HwCaps OnceLock + caps.has_amx / has_vnni / has_sve2 if-else chain with `unsafe { batched_gemm_amx(...) }` runtime branches. This is exactly the pattern CLAUDE.md and the W1a consumer contract forbid. - §3.3 framed Arch::CURRENT + build.rs as one option among several runtime/host-probe approaches, which kept the "branching is OK" mental model alive even after the host-vs-target fix. Rewritten to match the actual project pattern (per CLAUDE.md's Repository Structure + W1a contract): - "All dispatch is polyfill." cfg(target_feature = ...) selects exactly one backend file (simd_avx512.rs / simd_neon.rs / simd_scalar.rs) to compile in. No runtime branching. - Target CPU is fixed at build time via .cargo/config.toml's target-cpu = x86-64-v4 (AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. WoA fleet ships per-arch binaries, not a fat binary that probes. - R-5's per-arch crossover constant is also part of the polyfill: one const per backend file, cfg-selected. build.rs may emit a refined override into OUT_DIR (compile-time, target-config-driven, never host CPU probing) but the selection mechanism is still cfg. - Added explicit negative-rule sentence at the top of §3 listing forbidden patterns: no HwCaps / CpuCaps runtime branching, no `if has_avx512 else ...` dispatch, no `unsafe { runtime_branch }`. The previous wording was hallucinated branching that contradicted the substrate's actual design. The substrate ships ONE path per binary; the cfg selects which path at build time. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent 1bb4561 commit 8b1d376

1 file changed

Lines changed: 35 additions & 67 deletions

File tree

.claude/knowledge/pr-x12-woa-multiarch-orchestration.md

Lines changed: 35 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -74,97 +74,65 @@ This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tracta
7474

7575
---
7676

77-
## 3. Per-arch dispatch as a substrate property
77+
## 3. Per-arch substrate via compile-time polyfill
7878

79-
The PR-X12 substrate (per merged canon §M:E-G, §M:E-H, R-4, R-5, R-11) implements per-arch dispatch via three mechanisms:
79+
The PR-X12 substrate follows the project's W1a consumer contract (see `CLAUDE.md` and `.claude/knowledge/vertical-simd-consumer-contract.md`): **all dispatch is polyfill**. Per arch we ship a separate backend file with the same public surface, and `cfg(target_feature = ...)` selects exactly one to compile in. There is **no runtime CPU detection, no `HwCaps`/`CpuCaps` branching, no `if has_avx512 else …` dispatch, and no `unsafe { runtime_branch }` chain.** The target CPU is fixed at build time via `.cargo/config.toml` (`target-cpu=x86-64-v4` makes AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. One build, one path.
8080

81-
### 3.1 Compile-time `target_feature`
81+
### 3.1 The polyfill primitive: cfg-selected per-arch files
82+
83+
The pattern is the same one already shipping in `src/simd*.rs` (per `CLAUDE.md` Repository Structure):
8284

8385
```rust
84-
// In ndarray::hpc::blas_level2::batched_gemm:
86+
// src/simd.rs — consumer-facing surface, re-exports a single backend
87+
#[cfg(target_feature = "avx512f")]
88+
pub use crate::simd_avx512::*;
8589

86-
#[cfg(target_arch = "x86_64")]
87-
mod x86_dispatch {
88-
#[target_feature(enable = "avx512f,avx512bw,avx512vnni")]
89-
pub unsafe fn batched_gemm_vnni(...) { /* VNNI path */ }
90+
#[cfg(all(not(target_feature = "avx512f"), target_arch = "aarch64"))]
91+
pub use crate::simd_neon::*;
9092

91-
#[target_feature(enable = "amx-tile,amx-int8,amx-bf16")]
92-
pub unsafe fn batched_gemm_amx(...) { /* AMX path */ }
93-
}
93+
#[cfg(not(any(target_feature = "avx512f", target_arch = "aarch64")))]
94+
pub use crate::simd_scalar::*;
95+
```
9496

95-
#[cfg(target_arch = "aarch64")]
96-
mod arm_dispatch {
97-
#[target_feature(enable = "sve2")]
98-
pub unsafe fn batched_gemm_sve2(...) { /* SVE2 path */ }
97+
Each backend file (`simd_avx512.rs`, `simd_neon.rs`, `simd_scalar.rs`) implements the same public functions with identical signatures. The W1a contract requires **all three backends + a parity test** before any new primitive lands. The codec body (`ndarray-codec`, see R-3) and downstream consumers (burn / candle / lance-graph / surrealdb / WoA fleet) call `ndarray::simd::*` directly — they never see or reason about which backend is active. The cfg substitutes one file at the use-site; consumer code is identical across architectures.
9998

100-
#[target_feature(enable = "neon,fp16")]
101-
pub unsafe fn batched_gemm_neon_fp16(...) { /* Apple Silicon */ }
102-
}
103-
```
99+
### 3.2 Build-time CPU selection (not runtime detection)
104100

105-
### 3.2 Runtime feature detection (cached at process start)
101+
Target CPU is decided once, at build time:
106102

107-
```rust
108-
// In ndarray::hpc::capability:
109-
pub static CAP: OnceLock<HwCaps> = OnceLock::new();
110-
111-
pub struct HwCaps {
112-
pub has_amx: bool,
113-
pub has_vnni: bool,
114-
pub has_sve2: bool,
115-
pub has_neon_fp16: bool,
116-
pub l1_cache_size: usize,
117-
pub vec_width_bits: u16,
118-
// ... more as new features land
119-
}
103+
| Mechanism | Source | Effect |
104+
|---|---|---|
105+
| `.cargo/config.toml` `target-cpu=x86-64-v4` | repo policy | AVX-512 mandatory on x86_64 (per `CLAUDE.md`) |
106+
| `--target aarch64-apple-darwin` | CI / fleet build matrix | NEON-fp16 backend compiles in |
107+
| `--target aarch64-unknown-linux-gnu` + SVE2 target-feature | Graviton build | SVE2 backend compiles in |
120108

121-
pub fn batched_gemm(input: ...) {
122-
let caps = CAP.get().unwrap();
123-
if caps.has_amx { unsafe { batched_gemm_amx(input) } }
124-
else if caps.has_vnni { unsafe { batched_gemm_vnni(input) } }
125-
else if caps.has_sve2 { unsafe { batched_gemm_sve2(input) } }
126-
// ...
127-
else { batched_gemm_scalar(input) }
128-
}
129-
```
109+
The WoA fleet ships **per-arch binaries**, not a fat binary that probes. Q2 distributes the right binary to each node based on the node's already-known architecture (declared at registration time, not detected per request). Cross-arch determinism (§6 below) is enforced because each binary embeds exactly one backend and the W1a parity test gates every primitive at the substrate layer.
130110

131-
### 3.3 Per-arch tunable crossover (R-5 generalised)
111+
### 3.3 Per-arch tunable crossover (R-5)
132112

133-
Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target from compile-time metadata Cargo exposes to build scripts — `CARGO_CFG_TARGET_ARCH`, `CARGO_CFG_TARGET_FEATURE`, the target triple, and any pre-recorded calibration artifact — and emits the chosen integer as a generated `const` in `OUT_DIR`. **Critically, do NOT probe the host CPU inside `build.rs`**: under cross-compilation Cargo runs `build.rs` on the *host* machine, so any runtime feature-detection there reflects the build machine and not the target. Cargo's docs are explicit: use `CARGO_CFG_*` env vars (which correctly reflect the target) rather than `cfg!`/`#[cfg]` checks (which reflect the host the script runs on).
113+
Some operations (DCT-II vs GEMM, basin-lookup width, etc.) have a "small N: scalar path, large N: SIMD path" crossover whose break-even N varies per backend. The crossover lives in the **same polyfill** as the SIMD primitives: a `cfg(target_feature = ...)`-selected `const`.
134114

135115
```rust
136-
// Shape of the per-arch table (lives in a build-script-generated file
137-
// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))):
116+
// src/hpc/dct_crossover.rs — one const per backend file, cfg-selected
138117
//
139-
// pub const DCT_BATCH_CROSSOVER: usize = 64; // emitted by build.rs
140-
// // for Sapphire Rapids
141-
//
142-
// The build script's decision matrix, driven entirely by Cargo's
143-
// target-config env vars (no host CPU probing):
144-
// CARGO_CFG_TARGET_FEATURE contains "amx-bf16" → 64
145-
// CARGO_CFG_TARGET_FEATURE contains "avx512f" → 32 (skylake-x/ice lake)
146-
// CARGO_CFG_TARGET_FEATURE contains "avx512f", Zen-tuned target-cpu → 96
147-
// CARGO_CFG_TARGET_ARCH == "aarch64" + NEON-only → 256
148-
// CARGO_CFG_TARGET_ARCH == "aarch64" + SVE2 → 128
149-
// else → usize::MAX
150-
//
151-
// Equivalent in-crate fallback shape using cfg! (still target-resolved,
152-
// since cfg! in normal (non-build-script) code uses target cfgs):
153-
// const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 }
154-
// else if cfg!(target_feature = "avx512f") { 32 }
155-
// else if cfg!(target_arch = "aarch64") { 128 }
156-
// else { usize::MAX };
118+
// simd_avx512.rs: pub const DCT_BATCH_CROSSOVER: usize = 64;
119+
// simd_neon.rs (Apple Silicon): pub const DCT_BATCH_CROSSOVER: usize = 256;
120+
// simd_scalar.rs: pub const DCT_BATCH_CROSSOVER: usize = usize::MAX;
157121

158122
pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
159123
if N >= DCT_BATCH_CROSSOVER {
160-
unsafe { dct_gemm_path(input, output) }
124+
dct_gemm_path(input, output) // calls into ndarray::simd::*
161125
} else {
162-
dct_butterfly_path(input, output)
126+
dct_butterfly_path(input, output) // also calls into ndarray::simd::*
163127
}
164128
}
165129
```
166130

167-
R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum, and never via host CPU probing under cross-compilation. The build script (driven by `CARGO_CFG_TARGET_*`) is the single source of truth for which integer compiles in.
131+
The integer `DCT_BATCH_CROSSOVER` comes from one of two places:
132+
1. **Hand-tuned default**: a known-good number per backend, checked into the backend file.
133+
2. **Plan G calibration override**: `build.rs` may consult `CARGO_CFG_TARGET_FEATURE` + a pre-recorded calibration artifact from `codec-bench` and emit a refined const into `OUT_DIR`, included by the backend file. This is still compile-time selection — the build script never probes the host CPU, only reads Cargo's target-config env vars.
134+
135+
Either way the constant is **fixed in the compiled binary**. R-5 commits these crossovers as bench-tunable but compile-time-fixed; the `cfg(target_feature)`-selected backend file is the single source of truth.
168136

169137
---
170138

0 commit comments

Comments
 (0)