Skip to content

Commit 8d3c1d7

Browse files
authored
Merge pull request #171 from AdaWorldAPI/claude/simd-dispatch-architecture-doc
docs(simd): dispatch architecture + parity matrix + tech debt + integration plan
2 parents 207fc20 + 9016621 commit 8d3c1d7

1 file changed

Lines changed: 305 additions & 0 deletions

File tree

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
# SIMD Dispatch Architecture — design, parity, tech debt, integration plan
2+
3+
> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion).
4+
> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract),
5+
> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`.
6+
7+
## 1. Why this exists
8+
9+
`ndarray::simd::*` is the single public surface every cognitive-shader,
10+
splat, codec, BLAS, and FFI consumer reaches for. The current dispatch
11+
in `src/simd.rs` is **compile-time-only** with arms keyed off
12+
`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar
13+
fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking
14+
AVX-512 into every compiled artifact.
15+
16+
The consequence surfaced on PR #170 (`tests/1.95.0` CI run
17+
[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)):
18+
**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on
19+
a GitHub runner without AVX-512 silicon, all timing out uniformly
20+
~19 s — the symptom of "binary cannot execute" rather than assertion
21+
failure. The same configuration also leaves `simd_nightly/*` (the
22+
portable-SIMD polyfill backend) unreachable because no dispatch arm in
23+
`simd.rs` re-exports from it.
24+
25+
This document pins the target architecture, captures the parity gaps,
26+
ranks the technical debt, and sequences the integration.
27+
28+
## 2. Dispatch model — three build configs, one runtime mode
29+
30+
Each build mode is a **conscious cargo invocation** via a distinct
31+
`.cargo/config*.toml`. No silent fallbacks, no surprise hardware
32+
mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd`
33+
chose it deliberately.
34+
35+
| Config file | `target-cpu` | Dispatch strategy | Default? | Use case |
36+
|---|---|---|---|---|
37+
| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 |
38+
| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment |
39+
| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds |
40+
| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments |
41+
42+
The aarch64 path is automatic: any `target_arch = "aarch64"` build
43+
selects `simd_neon` regardless of the config above.
44+
45+
**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used
46+
when shipping a single release binary that must adapt at process
47+
start across heterogeneous deployment silicon (one binary running on
48+
AVX-512 + AVX2-only machines from the same artifact). It compiles all
49+
backends in and uses `LazyLock<CpuCaps>` trampolines. Reserved for the
50+
release-binary distribution path; never the dev / CI default.
51+
52+
### Dispatch precedence in `simd.rs`
53+
54+
Compile-time arms read like a cascade, **not** like priority overrides
55+
— each cargo config sets exactly one `target_feature` / `feature` such
56+
that exactly one arm matches. The order below is the source-of-truth
57+
ranking the compiler walks:
58+
59+
```rust
60+
// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature).
61+
// No `target_arch` constraint — `core::simd` is portable, so this
62+
// arm is the one true backend on wasm32 / riscv / any other target
63+
// as soon as `nightly-simd` is on. Keeping it unconditional on
64+
// `feature = "nightly-simd"` is what makes the `not(feature =
65+
// "nightly-simd")` exclusion on every other arm sound.
66+
#[cfg(feature = "nightly-simd")]
67+
pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};
68+
69+
// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
70+
#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
71+
pub use crate::simd_avx512::{...};
72+
73+
// 3. AVX2 baseline (the v3 / GitHub-CI default)
74+
#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
75+
pub use crate::simd_avx2::{...};
76+
77+
// 4. NEON (aarch64)
78+
#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
79+
pub use crate::simd_neon::aarch64_simd::{...};
80+
81+
// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without
82+
// AVX2, etc.). The predicate is the negation of arms 1-4 so that
83+
// *exactly one* arm matches on every (target, feature) pair.
84+
#[cfg(not(any(
85+
feature = "nightly-simd",
86+
all(target_arch = "x86_64", target_feature = "avx2"),
87+
target_arch = "aarch64",
88+
)))]
89+
pub use scalar::{...};
90+
```
91+
92+
Runtime dispatch via `LazyLock<CpuCaps>` lives in a separate
93+
`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch`
94+
flag, mutually exclusive with the compile-time arms above.
95+
96+
## 3. Module roles
97+
98+
```
99+
crate::simd::* ← user-facing key registry (re-exports only)
100+
101+
├── simd.rs = dispatch arms; no implementation, only `pub use`
102+
103+
├── simd_ops.rs = slice-level ops over crate::simd::* primitives
104+
│ (add_f32, scale_f64, array_chunks, …)
105+
106+
├── simd_avx512.rs = __m512* values, native 512-bit registers
107+
├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half
108+
│ wrappers (struct F32x16(pub f32x8, pub f32x8))
109+
├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes
110+
│ composed as [float32x4_t; 4] etc.
111+
├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable
112+
│ ├── f32_types.rs F32x16, F32x8
113+
│ ├── f64_types.rs F64x8, F64x4
114+
│ ├── u8_types.rs U8x64, U8x32
115+
│ ├── u_word_types.rs U16x32, U32x16, U64x8
116+
│ ├── i8_types.rs I8x64, I8x32
117+
│ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
118+
│ ├── bf16_types.rs BF16x16, BF16x8
119+
│ ├── f16_types.rs F16x16
120+
│ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8
121+
│ └── ops.rs op impls
122+
└── scalar (inline `mod scalar` in simd.rs)
123+
= pure-Rust fallback for unknown arch
124+
```
125+
126+
Every `simd_<arch>.rs` is just a SOURCE of typed primitives. `simd.rs`
127+
chooses the source; the cargo config chooses how `simd.rs` chooses.
128+
129+
## 4. Parity matrix — typed lane primitives per backend
130+
131+
Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
132+
scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch.
133+
134+
| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
135+
|---|---|---|---|---|---|
136+
| `F32x16` |`__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` |`[f32; 16]` |
137+
| `F32x8` |`__m256` ||| 🔵 ||
138+
| `F64x8` |`__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 ||
139+
| `F64x4` |`__m256d` ||| 🔵 ||
140+
| `U8x64` |`__m512i` ||| 🔵 ||
141+
| `U8x32` |`__m256i` |`__m256i` || 🔵 ||
142+
| `U16x32` |`__m512i` ||| 🔵 ||
143+
| `U32x16` |`__m512i` ||| 🔵 ||
144+
| `U64x8` |`__m512i` ||| 🔵 ||
145+
| `I8x32` |`__m256i` ||| 🔵 ||
146+
| `I8x64` |`__m512i` ||| 🔵 ||
147+
| `I16x16` |`__m256i` ||| 🔵 ||
148+
| `I16x32` |`__m512i` ||| 🔵 ||
149+
| `I32x16` |`__m512i` ||| 🔵 ||
150+
| `I64x8` |`__m512i` ||| 🔵 ||
151+
| `BF16x8` |`__m128bh` ||| 🔵 ||
152+
| `BF16x16` |`__m256bh` ||| 🔵 ||
153+
| `F16x16` || 🟡 `F16Scaler` (scalar) || 🔵 ||
154+
| `F32Mask16` |`__mmask16` |`u16` bitmask |`u16` bitmask | 🔵 ||
155+
| `F64Mask8` |`__mmask8` |`u8` bitmask |`u8` bitmask | 🔵 ||
156+
157+
**Aarch64-native narrower types** (only useful directly when the
158+
consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
159+
`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
160+
parity surface — consumers requesting 256-bit / 512-bit shapes go
161+
through the composed wrappers.
162+
163+
### Read of the matrix
164+
165+
- **F32x16 + F64x8 are universal** — all four backends ship them. Hot
166+
paths can rely on these without branching.
167+
- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`,
168+
`F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other
169+
cross-arch lane is missing — making the v3 default config crash any
170+
consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc.
171+
- **NEON is even sparser** at the 256/512-bit level.
172+
- **`simd_nightly` is the most complete** but is unreachable today
173+
because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its
174+
re-exports.
175+
- **`scalar`** has comprehensive cover and is the safest fallback for
176+
any arch the others miss, but lives inline in `simd.rs` rather than
177+
in a dedicated `simd_scalar.rs`. Symmetry would help.
178+
179+
## 5. Technical debt matrix
180+
181+
Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
182+
183+
| ID | Severity | Description | Detection | Fix scope |
184+
|---|---|---|---|---|
185+
| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. |
186+
| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. |
187+
| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. |
188+
| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. |
189+
| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
190+
| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
191+
| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
192+
| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
193+
| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
194+
| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
195+
196+
## 6. Integration plan — sequenced sprints
197+
198+
Each phase is a single-PR worker (sized for one Sonnet impl-sprint per
199+
the `.claude/EN/agents/worker-template.md` shape). Phases sequence so
200+
each lands a green CI; the next phase depends only on shipped state.
201+
202+
### Phase 1 — Unblock CI (P0 fixes)
203+
204+
**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml`
205+
build runs end-to-end on AVX2-only silicon.
206+
207+
| # | Worker | Scope | Files | Acceptance |
208+
|---|---|---|---|---|
209+
| 1.1 | flip baseline | Change `target-cpu` from `v4``v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs |
210+
| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only |
211+
| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` |
212+
213+
After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships
214+
green CI by default. AVX-512 testing becomes an explicit job.
215+
216+
### Phase 2 — Unblock the polyfill (P1: `nightly-simd`)
217+
218+
**Goal:** `cargo +nightly check --features nightly-simd` reaches
219+
`simd_nightly/*` via `crate::simd::*`. miri can execute the portable
220+
path.
221+
222+
| # | Worker | Scope | Files | Acceptance |
223+
|---|---|---|---|---|
224+
| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature |
225+
| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` |
226+
| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature |
227+
228+
### Phase 3 — NEON parity (P1)
229+
230+
**Goal:** aarch64 build reaches the same cross-arch lane set as the v3
231+
config.
232+
233+
| # | Worker | Scope | Files | Acceptance |
234+
|---|---|---|---|---|
235+
| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture |
236+
| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean |
237+
238+
### Phase 4 — Symmetry + ergonomics (P1/P2)
239+
240+
| # | Worker | Scope | Files | Acceptance |
241+
|---|---|---|---|---|
242+
| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs |
243+
| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation |
244+
| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers |
245+
246+
### Phase 5 — Runtime dispatch (P2, opt-in)
247+
248+
**Goal:** ship-once binaries that adapt across heterogeneous deployment
249+
silicon.
250+
251+
| # | Worker | Scope | Files | Acceptance |
252+
|---|---|---|---|---|
253+
| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact |
254+
| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline |
255+
| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green |
256+
257+
### Phase 6 — CI matrix for explicit AVX-512 (P3)
258+
259+
| # | Worker | Scope | Files | Acceptance |
260+
|---|---|---|---|---|
261+
| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner |
262+
263+
## 7. Open questions
264+
265+
1. **Runtime trampoline cost class.** Phase 5's per-op indirection
266+
adds one indirect call per `F32x16::add(...)`. Acceptable for the
267+
typical 100+ cycle SIMD-op cost, but consumer benchmarks should
268+
sanity-check before declaring the path production-ready.
269+
2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top
270+
of the cascade; alternative reading is "polyfill is for miri only,
271+
so put it BELOW the arch-specific arms and only fire on non-x86_64,
272+
non-aarch64 targets." The current proposal matches the user's
273+
"explicit opt-in wins" framing; revisit if there's a use case for
274+
`--features nightly-simd` on an AVX-512 host wanting the AVX-512
275+
path.
276+
3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is
277+
x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface.
278+
Out of scope for this document; tracked under PR-X10 A6
279+
(`linalg::distance`) follow-ups.
280+
281+
## 8. Cross-references
282+
283+
- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer
284+
contract every new SIMD primitive follows (struct methods on typed
285+
wrappers, three-backend parity test, saturating/overflow semantics
286+
documented).
287+
- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend
288+
integration consumer of `crate::simd::*`.
289+
- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse +
290+
Tantivy injection plan; depends on Phase 1 + 2 landing.
291+
- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!`
292+
reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
293+
- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set
294+
ready to be wired into `simd.rs` dispatch (Phase 2).
295+
296+
## 9. TL;DR
297+
298+
Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by
299+
default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path.
300+
`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1).
301+
`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*`
302+
becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase
303+
3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are
304+
P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing
305+
in order keeps every step green.

0 commit comments

Comments
 (0)