Merge pull request #171 from AdaWorldAPI/claude/simd-dispatch-architecture-doc

AdaWorldAPI · web-flow · commit 8d3c1d7006d5 · 2026-05-20T13:39:13.000+02:00
docs(simd): dispatch architecture + parity matrix + tech debt + integration plan
diff --git a/.claude/knowledge/simd-dispatch-architecture.md b/.claude/knowledge/simd-dispatch-architecture.md
@@ -0,0 +1,305 @@
+# SIMD Dispatch Architecture — design, parity, tech debt, integration plan
+
+> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion).
+> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract),
+> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`.
+
+## 1. Why this exists
+
+`ndarray::simd::*` is the single public surface every cognitive-shader,
+splat, codec, BLAS, and FFI consumer reaches for. The current dispatch
+in `src/simd.rs` is **compile-time-only** with arms keyed off
+`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar
+fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking
+AVX-512 into every compiled artifact.
+
+The consequence surfaced on PR #170 (`tests/1.95.0` CI run
+[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)):
+**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on
+a GitHub runner without AVX-512 silicon, all timing out uniformly
+~19 s — the symptom of "binary cannot execute" rather than assertion
+failure. The same configuration also leaves `simd_nightly/*` (the
+portable-SIMD polyfill backend) unreachable because no dispatch arm in
+`simd.rs` re-exports from it.
+
+This document pins the target architecture, captures the parity gaps,
+ranks the technical debt, and sequences the integration.
+
+## 2. Dispatch model — three build configs, one runtime mode
+
+Each build mode is a **conscious cargo invocation** via a distinct
+`.cargo/config*.toml`. No silent fallbacks, no surprise hardware
+mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd`
+chose it deliberately.
+
+| Config file | `target-cpu` | Dispatch strategy | Default? | Use case |
+|---|---|---|---|---|
+| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 |
+| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment |
+| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds |
+| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments |
+
+The aarch64 path is automatic: any `target_arch = "aarch64"` build
+selects `simd_neon` regardless of the config above.
+
+**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used
+when shipping a single release binary that must adapt at process
+start across heterogeneous deployment silicon (one binary running on
+AVX-512 + AVX2-only machines from the same artifact). It compiles all
+backends in and uses `LazyLock<CpuCaps>` trampolines. Reserved for the
+release-binary distribution path; never the dev / CI default.
+
+### Dispatch precedence in `simd.rs`
+
+Compile-time arms read like a cascade, **not** like priority overrides
+— each cargo config sets exactly one `target_feature` / `feature` such
+that exactly one arm matches. The order below is the source-of-truth
+ranking the compiler walks:
+
+```rust
+// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature).
+//    No `target_arch` constraint — `core::simd` is portable, so this
+//    arm is the one true backend on wasm32 / riscv / any other target
+//    as soon as `nightly-simd` is on. Keeping it unconditional on
+//    `feature = "nightly-simd"` is what makes the `not(feature =
+//    "nightly-simd")` exclusion on every other arm sound.
+#[cfg(feature = "nightly-simd")]
+pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8};
+
+// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts)
+#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))]
+pub use crate::simd_avx512::{...};
+
+// 3. AVX2 baseline (the v3 / GitHub-CI default)
+#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))]
+pub use crate::simd_avx2::{...};
+
+// 4. NEON (aarch64)
+#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))]
+pub use crate::simd_neon::aarch64_simd::{...};
+
+// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without
+//    AVX2, etc.). The predicate is the negation of arms 1-4 so that
+//    *exactly one* arm matches on every (target, feature) pair.
+#[cfg(not(any(
+    feature = "nightly-simd",
+    all(target_arch = "x86_64", target_feature = "avx2"),
+    target_arch = "aarch64",
+)))]
+pub use scalar::{...};
+```
+
+Runtime dispatch via `LazyLock<CpuCaps>` lives in a separate
+`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch`
+flag, mutually exclusive with the compile-time arms above.
+
+## 3. Module roles
+
+```
+crate::simd::*                ← user-facing key registry (re-exports only)
+        │
+        ├── simd.rs           = dispatch arms; no implementation, only `pub use`
+        │
+        ├── simd_ops.rs       = slice-level ops over crate::simd::* primitives
+        │                       (add_f32, scale_f64, array_chunks, …)
+        │
+        ├── simd_avx512.rs    = __m512* values, native 512-bit registers
+        ├── simd_avx2.rs      = __m256* values + (F32x16, F64x8) as two-half
+        │                       wrappers (struct F32x16(pub f32x8, pub f32x8))
+        ├── simd_neon.rs      = float32x4_t / uint64x2_t natives + larger shapes
+        │                       composed as [float32x4_t; 4] etc.
+        ├── simd_nightly/     = std::simd::* polyfill — portable, miri-executable
+        │   ├── f32_types.rs    F32x16, F32x8
+        │   ├── f64_types.rs    F64x8, F64x4
+        │   ├── u8_types.rs     U8x64, U8x32
+        │   ├── u_word_types.rs U16x32, U32x16, U64x8
+        │   ├── i8_types.rs     I8x64, I8x32
+        │   ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8
+        │   ├── bf16_types.rs   BF16x16, BF16x8
+        │   ├── f16_types.rs    F16x16
+        │   ├── masks.rs        F32Mask16, F32Mask8, F64Mask4, F64Mask8
+        │   └── ops.rs          op impls
+        └── scalar (inline `mod scalar` in simd.rs)
+                              = pure-Rust fallback for unknown arch
+```
+
+Every `simd_<arch>.rs` is just a SOURCE of typed primitives. `simd.rs`
+chooses the source; the cargo config chooses how `simd.rs` chooses.
+
+## 4. Parity matrix — typed lane primitives per backend
+
+Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵
+scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch.
+
+| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` |
+|---|---|---|---|---|---|
+| `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` |
+| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ |
+| `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ |
+| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ |
+| `U8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ |
+| `U16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `U32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `U64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `I8x32` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ |
+| `I8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `I16x16` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ |
+| `I16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `I32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `I64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ |
+| `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ |
+| `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ |
+| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ |
+| `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ |
+| `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ |
+
+**Aarch64-native narrower types** (only useful directly when the
+consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`,
+`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch
+parity surface — consumers requesting 256-bit / 512-bit shapes go
+through the composed wrappers.
+
+### Read of the matrix
+
+- **F32x16 + F64x8 are universal** — all four backends ship them. Hot
+  paths can rely on these without branching.
+- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`,
+  `F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other
+  cross-arch lane is missing — making the v3 default config crash any
+  consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc.
+- **NEON is even sparser** at the 256/512-bit level.
+- **`simd_nightly` is the most complete** but is unreachable today
+  because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its
+  re-exports.
+- **`scalar`** has comprehensive cover and is the safest fallback for
+  any arch the others miss, but lives inline in `simd.rs` rather than
+  in a dedicated `simd_scalar.rs`. Symmetry would help.
+
+## 5. Technical debt matrix
+
+Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have).
+
+| ID | Severity | Description | Detection | Fix scope |
+|---|---|---|---|---|
+| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. |
+| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. |
+| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. |
+| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. |
+| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. |
+| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. |
+| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. |
+| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. |
+| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. |
+| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. |
+
+## 6. Integration plan — sequenced sprints
+
+Each phase is a single-PR worker (sized for one Sonnet impl-sprint per
+the `.claude/EN/agents/worker-template.md` shape). Phases sequence so
+each lands a green CI; the next phase depends only on shipped state.
+
+### Phase 1 — Unblock CI (P0 fixes)
+
+**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml`
+build runs end-to-end on AVX2-only silicon.
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 1.1 | flip baseline | Change `target-cpu` from `v4` → `v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs |
+| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only |
+| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` |
+
+After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships
+green CI by default. AVX-512 testing becomes an explicit job.
+
+### Phase 2 — Unblock the polyfill (P1: `nightly-simd`)
+
+**Goal:** `cargo +nightly check --features nightly-simd` reaches
+`simd_nightly/*` via `crate::simd::*`. miri can execute the portable
+path.
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature |
+| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` |
+| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature |
+
+### Phase 3 — NEON parity (P1)
+
+**Goal:** aarch64 build reaches the same cross-arch lane set as the v3
+config.
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture |
+| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean |
+
+### Phase 4 — Symmetry + ergonomics (P1/P2)
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs |
+| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation |
+| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers |
+
+### Phase 5 — Runtime dispatch (P2, opt-in)
+
+**Goal:** ship-once binaries that adapt across heterogeneous deployment
+silicon.
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact |
+| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline |
+| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green |
+
+### Phase 6 — CI matrix for explicit AVX-512 (P3)
+
+| # | Worker | Scope | Files | Acceptance |
+|---|---|---|---|---|
+| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner |
+
+## 7. Open questions
+
+1. **Runtime trampoline cost class.** Phase 5's per-op indirection
+   adds one indirect call per `F32x16::add(...)`. Acceptable for the
+   typical 100+ cycle SIMD-op cost, but consumer benchmarks should
+   sanity-check before declaring the path production-ready.
+2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top
+   of the cascade; alternative reading is "polyfill is for miri only,
+   so put it BELOW the arch-specific arms and only fire on non-x86_64,
+   non-aarch64 targets." The current proposal matches the user's
+   "explicit opt-in wins" framing; revisit if there's a use case for
+   `--features nightly-simd` on an AVX-512 host wanting the AVX-512
+   path.
+3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is
+   x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface.
+   Out of scope for this document; tracked under PR-X10 A6
+   (`linalg::distance`) follow-ups.
+
+## 8. Cross-references
+
+- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer
+  contract every new SIMD primitive follows (struct methods on typed
+  wrappers, three-backend parity test, saturating/overflow semantics
+  documented).
+- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend
+  integration consumer of `crate::simd::*`.
+- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse +
+  Tantivy injection plan; depends on Phase 1 + 2 landing.
+- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!`
+  reporting (NOT dispatch) — repurpose for Phase 5 trampoline.
+- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set
+  ready to be wired into `simd.rs` dispatch (Phase 2).
+
+## 9. TL;DR
+
+Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by
+default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path.
+`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1).
+`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*`
+becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase
+3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are
+P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing
+in order keeps every step green.