You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(pr-x12): purge runtime-dispatch creep from woa-multiarch §3
The previous §3 violated the project's W1a polyfill contract by
describing runtime CPU dispatch as if it were the substrate's design:
- §3.1 listed multiple #[target_feature(enable = ...)] fns
coexisting in one mod, framed as "compile-time" but read like
"available to select from at runtime".
- §3.2 had an HwCaps OnceLock + caps.has_amx / has_vnni / has_sve2
if-else chain with `unsafe { batched_gemm_amx(...) }` runtime
branches. This is exactly the pattern CLAUDE.md and the W1a
consumer contract forbid.
- §3.3 framed Arch::CURRENT + build.rs as one option among several
runtime/host-probe approaches, which kept the "branching is OK"
mental model alive even after the host-vs-target fix.
Rewritten to match the actual project pattern (per CLAUDE.md's
Repository Structure + W1a contract):
- "All dispatch is polyfill." cfg(target_feature = ...) selects
exactly one backend file (simd_avx512.rs / simd_neon.rs /
simd_scalar.rs) to compile in. No runtime branching.
- Target CPU is fixed at build time via .cargo/config.toml's
target-cpu = x86-64-v4 (AVX-512 mandatory on x86_64) or via the
target triple for non-x86 builds. WoA fleet ships per-arch
binaries, not a fat binary that probes.
- R-5's per-arch crossover constant is also part of the polyfill:
one const per backend file, cfg-selected. build.rs may emit a
refined override into OUT_DIR (compile-time, target-config-driven,
never host CPU probing) but the selection mechanism is still cfg.
- Added explicit negative-rule sentence at the top of §3 listing
forbidden patterns: no HwCaps / CpuCaps runtime branching, no
`if has_avx512 else ...` dispatch, no `unsafe { runtime_branch }`.
The previous wording was hallucinated branching that contradicted the
substrate's actual design. The substrate ships ONE path per binary;
the cfg selects which path at build time.
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Copy file name to clipboardExpand all lines: .claude/knowledge/pr-x12-woa-multiarch-orchestration.md
+35-67Lines changed: 35 additions & 67 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,97 +74,65 @@ This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tracta
74
74
75
75
---
76
76
77
-
## 3. Per-arch dispatch as a substrate property
77
+
## 3. Per-arch substrate via compile-time polyfill
78
78
79
-
The PR-X12 substrate (per merged canon §M:E-G, §M:E-H, R-4, R-5, R-11) implements per-arch dispatch via three mechanisms:
79
+
The PR-X12 substrate follows the project's W1a consumer contract (see `CLAUDE.md` and `.claude/knowledge/vertical-simd-consumer-contract.md`): **all dispatch is polyfill**. Per arch we ship a separate backend file with the same public surface, and `cfg(target_feature = ...)` selects exactly one to compile in. There is **no runtime CPU detection, no `HwCaps`/`CpuCaps` branching, no `if has_avx512 else …` dispatch, and no `unsafe { runtime_branch }` chain.** The target CPU is fixed at build time via `.cargo/config.toml` (`target-cpu=x86-64-v4` makes AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. One build, one path.
80
80
81
-
### 3.1 Compile-time `target_feature`
81
+
### 3.1 The polyfill primitive: cfg-selected per-arch files
82
+
83
+
The pattern is the same one already shipping in `src/simd*.rs` (per `CLAUDE.md` Repository Structure):
82
84
83
85
```rust
84
-
// In ndarray::hpc::blas_level2::batched_gemm:
86
+
// src/simd.rs — consumer-facing surface, re-exports a single backend
Each backend file (`simd_avx512.rs`, `simd_neon.rs`, `simd_scalar.rs`) implements the same public functions with identical signatures. The W1a contract requires **all three backends + a parity test** before any new primitive lands. The codec body (`ndarray-codec`, see R-3) and downstream consumers (burn / candle / lance-graph / surrealdb / WoA fleet) call `ndarray::simd::*` directly — they never see or reason about which backend is active. The cfg substitutes one file at the use-site; consumer code is identical across architectures.
99
98
100
-
#[target_feature(enable ="neon,fp16")]
101
-
pubunsafefnbatched_gemm_neon_fp16(...) { /* Apple Silicon */ }
102
-
}
103
-
```
99
+
### 3.2 Build-time CPU selection (not runtime detection)
104
100
105
-
### 3.2 Runtime feature detection (cached at process start)
The WoA fleet ships **per-arch binaries**, not a fat binary that probes. Q2 distributes the right binary to each node based on the node's already-known architecture (declared at registration time, not detected per request). Cross-arch determinism (§6 below) is enforced because each binary embeds exactly one backend and the W1a parity test gates every primitive at the substrate layer.
Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target from compile-time metadata Cargo exposes to build scripts — `CARGO_CFG_TARGET_ARCH`, `CARGO_CFG_TARGET_FEATURE`, the target triple, and any pre-recorded calibration artifact — and emits the chosen integer as a generated `const`in `OUT_DIR`. **Critically, do NOT probe the host CPU inside `build.rs`**: under cross-compilation Cargo runs `build.rs` on the *host* machine, so any runtime feature-detection there reflects the build machine and not the target. Cargo's docs are explicit: use `CARGO_CFG_*` env vars (which correctly reflect the target) rather than `cfg!`/`#[cfg]` checks (which reflect the host the script runs on).
113
+
Some operations (DCT-II vs GEMM, basin-lookup width, etc.) have a "small N: scalar path, large N: SIMD path" crossover whose break-even N varies per backend. The crossover lives in the **same polyfill** as the SIMD primitives: a `cfg(target_feature = ...)`-selected `const`.
134
114
135
115
```rust
136
-
// Shape of the per-arch table (lives in a build-script-generated file
137
-
// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))):
116
+
// src/hpc/dct_crossover.rs — one const per backend file, cfg-selected
dct_gemm_path(input, output) // calls into ndarray::simd::*
161
125
} else {
162
-
dct_butterfly_path(input, output)
126
+
dct_butterfly_path(input, output)// also calls into ndarray::simd::*
163
127
}
164
128
}
165
129
```
166
130
167
-
R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum, and never via host CPU probing under cross-compilation. The build script (driven by `CARGO_CFG_TARGET_*`) is the single source of truth for which integer compiles in.
131
+
The integer `DCT_BATCH_CROSSOVER` comes from one of two places:
132
+
1.**Hand-tuned default**: a known-good number per backend, checked into the backend file.
133
+
2.**Plan G calibration override**: `build.rs` may consult `CARGO_CFG_TARGET_FEATURE` + a pre-recorded calibration artifact from `codec-bench` and emit a refined const into `OUT_DIR`, included by the backend file. This is still compile-time selection — the build script never probes the host CPU, only reads Cargo's target-config env vars.
134
+
135
+
Either way the constant is **fixed in the compiled binary**. R-5 commits these crossovers as bench-tunable but compile-time-fixed; the `cfg(target_feature)`-selected backend file is the single source of truth.
0 commit comments