|
| 1 | +# SIMD Dispatch Architecture — design, parity, tech debt, integration plan |
| 2 | + |
| 3 | +> Date: 2026-05-20 · Status: design v1 (post PR #170 PR-X12 A1 discussion). |
| 4 | +> Companion to: `vertical-simd-consumer-contract.md` (W1a consumer contract), |
| 5 | +> `databend-ndarray-simd-prompt.md`, `ndarray-simd-trojan-horse-prompt.md`. |
| 6 | +
|
| 7 | +## 1. Why this exists |
| 8 | + |
| 9 | +`ndarray::simd::*` is the single public surface every cognitive-shader, |
| 10 | +splat, codec, BLAS, and FFI consumer reaches for. The current dispatch |
| 11 | +in `src/simd.rs` is **compile-time-only** with arms keyed off |
| 12 | +`target_feature = "avx512f"` / `target_arch = "aarch64"` / scalar |
| 13 | +fallback. `.cargo/config.toml` pins `target-cpu = x86-64-v4`, baking |
| 14 | +AVX-512 into every compiled artifact. |
| 15 | + |
| 16 | +The consequence surfaced on PR #170 (`tests/1.95.0` CI run |
| 17 | +[26151746204/76920666348](https://github.com/AdaWorldAPI/ndarray/actions/runs/26151746204/job/76920666348)): |
| 18 | +**38 tests in `simd_avx2`, `simd_amx`, `simd_ops`, `simd_soa` SIGILL** on |
| 19 | +a GitHub runner without AVX-512 silicon, all timing out uniformly |
| 20 | +~19 s — the symptom of "binary cannot execute" rather than assertion |
| 21 | +failure. The same configuration also leaves `simd_nightly/*` (the |
| 22 | +portable-SIMD polyfill backend) unreachable because no dispatch arm in |
| 23 | +`simd.rs` re-exports from it. |
| 24 | + |
| 25 | +This document pins the target architecture, captures the parity gaps, |
| 26 | +ranks the technical debt, and sequences the integration. |
| 27 | + |
| 28 | +## 2. Dispatch model — three build configs, one runtime mode |
| 29 | + |
| 30 | +Each build mode is a **conscious cargo invocation** via a distinct |
| 31 | +`.cargo/config*.toml`. No silent fallbacks, no surprise hardware |
| 32 | +mismatch. Whoever builds with `v3` / `v4` / `native` / `nightly-simd` |
| 33 | +chose it deliberately. |
| 34 | + |
| 35 | +| Config file | `target-cpu` | Dispatch strategy | Default? | Use case | |
| 36 | +|---|---|---|---|---| |
| 37 | +| `.cargo/config.toml` | `x86-64-v3` (AVX2) | compile-time → `simd_avx2` | ✅ default, GitHub CI | portable artifact across all x86_64 silicon ≥ 2013 | |
| 38 | +| `.cargo/config-avx512.toml` | `x86-64-v4` (AVX-512) | compile-time → `simd_avx512` | explicit | benchmarking, AVX-512 deployment | |
| 39 | +| `.cargo/config-native.toml` | `native` | compile-time, build-machine CPUID resolved at rustc invocation → whatever arm matches the build host | explicit | developer machine builds | |
| 40 | +| `.cargo/config-nightly.toml` (+ `--features nightly-simd`) | `x86-64-v3` (or any) | compile-time → `simd_nightly` (`std::simd::*` polyfill) | explicit | miri / cargo-careful / portable-SIMD experiments | |
| 41 | + |
| 42 | +The aarch64 path is automatic: any `target_arch = "aarch64"` build |
| 43 | +selects `simd_neon` regardless of the config above. |
| 44 | + |
| 45 | +**Runtime LazyLock dispatch** is a separate, fifth opt-in mode used |
| 46 | +when shipping a single release binary that must adapt at process |
| 47 | +start across heterogeneous deployment silicon (one binary running on |
| 48 | +AVX-512 + AVX2-only machines from the same artifact). It compiles all |
| 49 | +backends in and uses `LazyLock<CpuCaps>` trampolines. Reserved for the |
| 50 | +release-binary distribution path; never the dev / CI default. |
| 51 | + |
| 52 | +### Dispatch precedence in `simd.rs` |
| 53 | + |
| 54 | +Compile-time arms read like a cascade, **not** like priority overrides |
| 55 | +— each cargo config sets exactly one `target_feature` / `feature` such |
| 56 | +that exactly one arm matches. The order below is the source-of-truth |
| 57 | +ranking the compiler walks: |
| 58 | + |
| 59 | +```rust |
| 60 | +// 1. Explicit portable-SIMD polyfill (nightly + opt-in feature). |
| 61 | +// No `target_arch` constraint — `core::simd` is portable, so this |
| 62 | +// arm is the one true backend on wasm32 / riscv / any other target |
| 63 | +// as soon as `nightly-simd` is on. Keeping it unconditional on |
| 64 | +// `feature = "nightly-simd"` is what makes the `not(feature = |
| 65 | +// "nightly-simd")` exclusion on every other arm sound. |
| 66 | +#[cfg(feature = "nightly-simd")] |
| 67 | +pub use crate::simd_nightly::{F32x16, F64x8, U8x32, U8x64, U16x32, U32x16, U64x8, I8x32, I8x64, I16x16, I16x32, I32x16, I64x8, F32Mask16, F64Mask8, BF16x16, BF16x8}; |
| 68 | + |
| 69 | +// 2. AVX-512 (target_feature = "avx512f"; set by `v4` and `native` configs on AVX-512 hosts) |
| 70 | +#[cfg(all(target_arch = "x86_64", target_feature = "avx512f", not(feature = "nightly-simd")))] |
| 71 | +pub use crate::simd_avx512::{...}; |
| 72 | + |
| 73 | +// 3. AVX2 baseline (the v3 / GitHub-CI default) |
| 74 | +#[cfg(all(target_arch = "x86_64", target_feature = "avx2", not(target_feature = "avx512f"), not(feature = "nightly-simd")))] |
| 75 | +pub use crate::simd_avx2::{...}; |
| 76 | + |
| 77 | +// 4. NEON (aarch64) |
| 78 | +#[cfg(all(target_arch = "aarch64", not(feature = "nightly-simd")))] |
| 79 | +pub use crate::simd_neon::aarch64_simd::{...}; |
| 80 | + |
| 81 | +// 5. Scalar fallback (everything else: wasm32, riscv, x86_64 without |
| 82 | +// AVX2, etc.). The predicate is the negation of arms 1-4 so that |
| 83 | +// *exactly one* arm matches on every (target, feature) pair. |
| 84 | +#[cfg(not(any( |
| 85 | + feature = "nightly-simd", |
| 86 | + all(target_arch = "x86_64", target_feature = "avx2"), |
| 87 | + target_arch = "aarch64", |
| 88 | +)))] |
| 89 | +pub use scalar::{...}; |
| 90 | +``` |
| 91 | + |
| 92 | +Runtime dispatch via `LazyLock<CpuCaps>` lives in a separate |
| 93 | +`simd_runtime` module (TBD per § 7.1) reached by a `--features runtime-dispatch` |
| 94 | +flag, mutually exclusive with the compile-time arms above. |
| 95 | + |
| 96 | +## 3. Module roles |
| 97 | + |
| 98 | +``` |
| 99 | +crate::simd::* ← user-facing key registry (re-exports only) |
| 100 | + │ |
| 101 | + ├── simd.rs = dispatch arms; no implementation, only `pub use` |
| 102 | + │ |
| 103 | + ├── simd_ops.rs = slice-level ops over crate::simd::* primitives |
| 104 | + │ (add_f32, scale_f64, array_chunks, …) |
| 105 | + │ |
| 106 | + ├── simd_avx512.rs = __m512* values, native 512-bit registers |
| 107 | + ├── simd_avx2.rs = __m256* values + (F32x16, F64x8) as two-half |
| 108 | + │ wrappers (struct F32x16(pub f32x8, pub f32x8)) |
| 109 | + ├── simd_neon.rs = float32x4_t / uint64x2_t natives + larger shapes |
| 110 | + │ composed as [float32x4_t; 4] etc. |
| 111 | + ├── simd_nightly/ = std::simd::* polyfill — portable, miri-executable |
| 112 | + │ ├── f32_types.rs F32x16, F32x8 |
| 113 | + │ ├── f64_types.rs F64x8, F64x4 |
| 114 | + │ ├── u8_types.rs U8x64, U8x32 |
| 115 | + │ ├── u_word_types.rs U16x32, U32x16, U64x8 |
| 116 | + │ ├── i8_types.rs I8x64, I8x32 |
| 117 | + │ ├── i_word_types.rs I16x16, I16x32, I32x16, I64x8 |
| 118 | + │ ├── bf16_types.rs BF16x16, BF16x8 |
| 119 | + │ ├── f16_types.rs F16x16 |
| 120 | + │ ├── masks.rs F32Mask16, F32Mask8, F64Mask4, F64Mask8 |
| 121 | + │ └── ops.rs op impls |
| 122 | + └── scalar (inline `mod scalar` in simd.rs) |
| 123 | + = pure-Rust fallback for unknown arch |
| 124 | +``` |
| 125 | + |
| 126 | +Every `simd_<arch>.rs` is just a SOURCE of typed primitives. `simd.rs` |
| 127 | +chooses the source; the cargo config chooses how `simd.rs` chooses. |
| 128 | + |
| 129 | +## 4. Parity matrix — typed lane primitives per backend |
| 130 | + |
| 131 | +Legend: ✅ native, 🟡 composed wrapper (two-half / four-quarter), 🔵 |
| 132 | +scalar polyfill via `core::simd`, ❌ missing, ⛔ N/A for this arch. |
| 133 | + |
| 134 | +| Lane type | `simd_avx512` (v4) | `simd_avx2` (v3) | `simd_neon` (aarch64) | `simd_nightly` | `scalar` | |
| 135 | +|---|---|---|---|---|---| |
| 136 | +| `F32x16` | ✅ `__m512` | 🟡 `(f32x8, f32x8)` | 🟡 `[float32x4_t; 4]` | 🔵 `core::simd::f32x16` | ✅ `[f32; 16]` | |
| 137 | +| `F32x8` | ✅ `__m256` | ❌ | ⛔ | 🔵 | ✅ | |
| 138 | +| `F64x8` | ✅ `__m512d` | 🟡 `(f64x4, f64x4)` | 🟡 `[float64x2_t; 4]` | 🔵 | ✅ | |
| 139 | +| `F64x4` | ✅ `__m256d` | ❌ | ⛔ | 🔵 | ✅ | |
| 140 | +| `U8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 141 | +| `U8x32` | ✅ `__m256i` | ✅ `__m256i` | ❌ | 🔵 | ✅ | |
| 142 | +| `U16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 143 | +| `U32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 144 | +| `U64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 145 | +| `I8x32` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ | |
| 146 | +| `I8x64` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 147 | +| `I16x16` | ✅ `__m256i` | ❌ | ❌ | 🔵 | ✅ | |
| 148 | +| `I16x32` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 149 | +| `I32x16` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 150 | +| `I64x8` | ✅ `__m512i` | ❌ | ❌ | 🔵 | ✅ | |
| 151 | +| `BF16x8` | ✅ `__m128bh` | ❌ | ❌ | 🔵 | ✅ | |
| 152 | +| `BF16x16` | ✅ `__m256bh` | ❌ | ❌ | 🔵 | ✅ | |
| 153 | +| `F16x16` | ❌ | 🟡 `F16Scaler` (scalar) | ❌ | 🔵 | ✅ | |
| 154 | +| `F32Mask16` | ✅ `__mmask16` | ✅ `u16` bitmask | ✅ `u16` bitmask | 🔵 | ✅ | |
| 155 | +| `F64Mask8` | ✅ `__mmask8` | ✅ `u8` bitmask | ✅ `u8` bitmask | 🔵 | ✅ | |
| 156 | + |
| 157 | +**Aarch64-native narrower types** (only useful directly when the |
| 158 | +consumer wants 128-bit shapes): `I8x16`, `I16x8`, `U8x16`, `U16x8`, |
| 159 | +`U32x4`, `U64x2`, `I32x4`, `I64x2`. These are not in the cross-arch |
| 160 | +parity surface — consumers requesting 256-bit / 512-bit shapes go |
| 161 | +through the composed wrappers. |
| 162 | + |
| 163 | +### Read of the matrix |
| 164 | + |
| 165 | +- **F32x16 + F64x8 are universal** — all four backends ship them. Hot |
| 166 | + paths can rely on these without branching. |
| 167 | +- **`simd_avx2` is the bottleneck.** It only exposes `F32x16`, `F64x8`, |
| 168 | + `F32Mask16`, `F64Mask8`, `U8x32`, and an `F16Scaler`. Every other |
| 169 | + cross-arch lane is missing — making the v3 default config crash any |
| 170 | + consumer that reaches for `U64x8`, `I32x16`, `U16x32`, etc. |
| 171 | +- **NEON is even sparser** at the 256/512-bit level. |
| 172 | +- **`simd_nightly` is the most complete** but is unreachable today |
| 173 | + because `simd.rs` has no arm wiring `feature = "nightly-simd"` to its |
| 174 | + re-exports. |
| 175 | +- **`scalar`** has comprehensive cover and is the safest fallback for |
| 176 | + any arch the others miss, but lives inline in `simd.rs` rather than |
| 177 | + in a dedicated `simd_scalar.rs`. Symmetry would help. |
| 178 | + |
| 179 | +## 5. Technical debt matrix |
| 180 | + |
| 181 | +Ranked by P0 (blocks current CI / consumers) → P3 (nice-to-have). |
| 182 | + |
| 183 | +| ID | Severity | Description | Detection | Fix scope | |
| 184 | +|---|---|---|---|---| |
| 185 | +| **TD-SIMD-1** | **P0** | `.cargo/config.toml` defaults to `x86-64-v4` → every CI runner without AVX-512 silicon SIGILLs on the first SIMD op. 38 tests fail at 19 s timeout each on `tests/1.95.0`. | PR #170 CI run | Change default to `x86-64-v3`; add `.cargo/config-avx512.toml` for the opt-in AVX-512 path. ~5 LoC. | |
| 186 | +| **TD-SIMD-2** | **P0** | `simd_avx2.rs` ships `F32x16`/`F64x8`/`U8x32` only. Consumers requesting `U64x8`, `I32x16`, `U16x32`, `BF16x16`, etc. fail to compile on the v3 path. | grep `pub use crate::simd_avx2::` then cross-ref against the parity matrix | Add the missing types as two-half wrappers (`U64x8(pub u64x4, pub u64x4)` etc.) over native `__m256i` halves. ~500 LoC. | |
| 187 | +| **TD-SIMD-3** | **P1** | `simd.rs` has no dispatch arm for `#[cfg(feature = "nightly-simd")]` → the `simd_nightly` polyfill is unreachable. miri / cargo-careful jobs that should exercise the portable path fall through to whatever cfg cascade matches, never to `std::simd::*`. | grep `simd_nightly` in `simd.rs` (returns 0 dispatch arms) | Add the `feature = "nightly-simd"` arm at the top of the cascade per § 2. ~30 LoC. | |
| 188 | +| **TD-SIMD-4** | **P1** | `simd_neon.rs` only ships `F32x16` / `F64x8` cross-arch shapes. Consumers reaching for `U8x64`, `U64x8`, `I32x16`, etc. on aarch64 have no path. | grep + parity matrix | Compose larger shapes from native NEON 128-bit lanes (`U8x64([uint8x16_t; 4])`, `U64x8([uint64x2_t; 4])`, etc.). ~400 LoC. | |
| 189 | +| **TD-SIMD-5** | **P1** | Scalar fallback inline in `simd.rs` (`pub(crate) mod scalar`) makes symmetry hard — every other backend is its own file. | inspection | Promote to `src/simd_scalar.rs`; `simd.rs` becomes pure dispatch. ~mechanical refactor. | |
| 190 | +| **TD-SIMD-6** | **P2** | No `runtime-dispatch` feature / `simd_runtime` module exists yet. Release-binary distribution to heterogeneous silicon requires recompile per target today. | `grep -r "LazyLock<CpuCaps>"` only matches reporting code in `simd.rs:52-55` | New module wiring per-op trampolines from the compiled-in backends. ~300 LoC + one new cargo feature. | |
| 191 | +| **TD-SIMD-7** | **P2** | Compile-time arms in `simd.rs:153-194` are duplicated four times (one per type group: F32x16, F64x8, U8x32, BF16x16). Adding a new lane requires copy-pasting four `#[cfg(...)]` arms. | inspection | Single source-of-truth macro emitting the arms. ~one macro_rules!, 50 LoC. | |
| 192 | +| **TD-SIMD-8** | **P2** | `F16Scaler` in `simd_avx2.rs:2566` is a scalar implementation masquerading as a SIMD type. Consumers using `F16x16` on v3 get scalar perf without warning. | grep `F16Scaler` | Either gate `F16x16` behind `target_feature = "f16c"` or rename / document the scalar nature. ~20 LoC + docs. | |
| 193 | +| **TD-SIMD-9** | **P3** | No CI matrix entry for the `nightly-simd` polyfill path. | `.github/workflows/ci.yaml` | Add a `nightly-simd-polyfill` job that builds with `--features nightly-simd` on nightly rustc. ~20 LoC YAML. | |
| 194 | +| **TD-SIMD-10** | **P3** | No CI matrix entry for `.cargo/config-avx512.toml`. AVX-512 deployment path silently bit-rots between PRs. | `.github/workflows/ci.yaml` | Add an `avx-512-explicit` job using a runner with AVX-512 silicon. ~20 LoC YAML; runner availability TBD. | |
| 195 | + |
| 196 | +## 6. Integration plan — sequenced sprints |
| 197 | + |
| 198 | +Each phase is a single-PR worker (sized for one Sonnet impl-sprint per |
| 199 | +the `.claude/EN/agents/worker-template.md` shape). Phases sequence so |
| 200 | +each lands a green CI; the next phase depends only on shipped state. |
| 201 | + |
| 202 | +### Phase 1 — Unblock CI (P0 fixes) |
| 203 | + |
| 204 | +**Goal:** GitHub `tests/1.95.0` job green. The default `.cargo/config.toml` |
| 205 | +build runs end-to-end on AVX2-only silicon. |
| 206 | + |
| 207 | +| # | Worker | Scope | Files | Acceptance | |
| 208 | +|---|---|---|---|---| |
| 209 | +| 1.1 | flip baseline | Change `target-cpu` from `v4` → `v3`. Add `.cargo/config-avx512.toml` with the old `v4` value. | `.cargo/config.toml`, `.cargo/config-avx512.toml` | `cargo check` clean on default; `tests/1.95.0` no longer SIGILLs | |
| 210 | +| 1.2 | AVX2 two-half wrappers — float | Add `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` as two-half wrappers over native AVX2 `__m256i` halves. | `src/simd_avx2.rs` | per-type parity test vs `simd_avx512` on AVX-512 host; per-type unit test on AVX2-only | |
| 211 | +| 1.3 | simd.rs dispatch refresh | Add the AVX2 cfg arm wiring the new wrappers; tighten existing arms with the new precedence (per § 2). | `src/simd.rs` | `cargo check --features approx,serde,rayon` clean on default config; `cargo check` clean on `--config .cargo/config-avx512.toml` | |
| 212 | + |
| 213 | +After Phase 1, PR #170 (PR-X12 A1) and any future consumer PR ships |
| 214 | +green CI by default. AVX-512 testing becomes an explicit job. |
| 215 | + |
| 216 | +### Phase 2 — Unblock the polyfill (P1: `nightly-simd`) |
| 217 | + |
| 218 | +**Goal:** `cargo +nightly check --features nightly-simd` reaches |
| 219 | +`simd_nightly/*` via `crate::simd::*`. miri can execute the portable |
| 220 | +path. |
| 221 | + |
| 222 | +| # | Worker | Scope | Files | Acceptance | |
| 223 | +|---|---|---|---|---| |
| 224 | +| 2.1 | nightly-simd dispatch arm | Add `#[cfg(feature = "nightly-simd")]` arms in `simd.rs` re-exporting every typed lane from `crate::simd_nightly::*`. | `src/simd.rs` | `crate::simd::F32x16` resolves to `core::simd::f32x16` under the feature | |
| 225 | +| 2.2 | nightly-simd parity tests | Run the existing simd_ops / simd_soa test suite against the polyfill backend. | `src/simd_nightly/tests.rs` | all simd_ops + simd_soa tests pass under `--features nightly-simd` | |
| 226 | +| 2.3 | CI matrix | Add `nightly-simd-polyfill` job to `.github/workflows/ci.yaml`. | `.github/workflows/ci.yaml` | job green on nightly rustc with the feature | |
| 227 | + |
| 228 | +### Phase 3 — NEON parity (P1) |
| 229 | + |
| 230 | +**Goal:** aarch64 build reaches the same cross-arch lane set as the v3 |
| 231 | +config. |
| 232 | + |
| 233 | +| # | Worker | Scope | Files | Acceptance | |
| 234 | +|---|---|---|---|---| |
| 235 | +| 3.1 | NEON quartet wrappers | Compose `U8x64`, `U64x8`, `U32x16`, `U16x32`, `I8x32`, `I8x64`, `I16x16`, `I16x32`, `I32x16`, `I64x8` from native 128-bit NEON lanes. | `src/simd_neon.rs` | parity vs `simd_avx2` two-half wrappers on a 16-pair fixture | |
| 236 | +| 3.2 | simd.rs aarch64 arms | Extend `aarch64` arms to re-export the new types. | `src/simd.rs` | `cargo check --target aarch64-unknown-linux-gnu` clean | |
| 237 | + |
| 238 | +### Phase 4 — Symmetry + ergonomics (P1/P2) |
| 239 | + |
| 240 | +| # | Worker | Scope | Files | Acceptance | |
| 241 | +|---|---|---|---|---| |
| 242 | +| 4.1 | scalar → file | Promote `mod scalar` to `src/simd_scalar.rs`. | `src/simd.rs`, new `src/simd_scalar.rs` | no behaviour change; `cargo check` clean on all configs | |
| 243 | +| 4.2 | dispatch macro | Collapse the 4× duplicated `#[cfg(...)]` blocks into one macro. | `src/simd.rs` | adding a new lane type is one macro invocation | |
| 244 | +| 4.3 | F16 honesty | Either rename `F16Scaler` or gate `F16x16` behind `f16c`. | `src/simd_avx2.rs` | scalar perf no longer surprises hot-path consumers | |
| 245 | + |
| 246 | +### Phase 5 — Runtime dispatch (P2, opt-in) |
| 247 | + |
| 248 | +**Goal:** ship-once binaries that adapt across heterogeneous deployment |
| 249 | +silicon. |
| 250 | + |
| 251 | +| # | Worker | Scope | Files | Acceptance | |
| 252 | +|---|---|---|---|---| |
| 253 | +| 5.1 | `simd_runtime` module | New module compiling all backends in and selecting per-op trampolines via `LazyLock<CpuCaps>`. | `src/simd_runtime.rs` | one binary runs on AVX-512 + AVX2-only hosts from the same artifact | |
| 254 | +| 5.2 | feature flag | New `runtime-dispatch` cargo feature, mutually exclusive with `nightly-simd`. | `Cargo.toml`, `src/simd.rs` | `cargo check --features runtime-dispatch` clean on the v3 baseline | |
| 255 | +| 5.3 | CI matrix | Add a `runtime-dispatch-portable` job. | `.github/workflows/ci.yaml` | job green | |
| 256 | + |
| 257 | +### Phase 6 — CI matrix for explicit AVX-512 (P3) |
| 258 | + |
| 259 | +| # | Worker | Scope | Files | Acceptance | |
| 260 | +|---|---|---|---|---| |
| 261 | +| 6.1 | AVX-512 explicit job | Add `avx-512-explicit` to `.github/workflows/ci.yaml` using `--config .cargo/config-avx512.toml`. Requires AVX-512-capable runner. | `.github/workflows/ci.yaml` | green on the AVX-512 runner | |
| 262 | + |
| 263 | +## 7. Open questions |
| 264 | + |
| 265 | +1. **Runtime trampoline cost class.** Phase 5's per-op indirection |
| 266 | + adds one indirect call per `F32x16::add(...)`. Acceptable for the |
| 267 | + typical 100+ cycle SIMD-op cost, but consumer benchmarks should |
| 268 | + sanity-check before declaring the path production-ready. |
| 269 | +2. **`feature = "nightly-simd"` precedence.** § 2 puts it at the top |
| 270 | + of the cascade; alternative reading is "polyfill is for miri only, |
| 271 | + so put it BELOW the arch-specific arms and only fire on non-x86_64, |
| 272 | + non-aarch64 targets." The current proposal matches the user's |
| 273 | + "explicit opt-in wins" framing; revisit if there's a use case for |
| 274 | + `--features nightly-simd` on an AVX-512 host wanting the AVX-512 |
| 275 | + path. |
| 276 | +3. **AMX status.** `simd_amx.rs` (Sapphire Rapids+ tile ops) is |
| 277 | + x86_64-only and orthogonal to the F32x16 / U8x64 cross-arch surface. |
| 278 | + Out of scope for this document; tracked under PR-X10 A6 |
| 279 | + (`linalg::distance`) follow-ups. |
| 280 | + |
| 281 | +## 8. Cross-references |
| 282 | + |
| 283 | +- `.claude/knowledge/vertical-simd-consumer-contract.md` — W1a consumer |
| 284 | + contract every new SIMD primitive follows (struct methods on typed |
| 285 | + wrappers, three-backend parity test, saturating/overflow semantics |
| 286 | + documented). |
| 287 | +- `.claude/knowledge/databend-ndarray-simd-prompt.md` — Databend |
| 288 | + integration consumer of `crate::simd::*`. |
| 289 | +- `.claude/knowledge/ndarray-simd-trojan-horse-prompt.md` — ClickHouse + |
| 290 | + Tantivy injection plan; depends on Phase 1 + 2 landing. |
| 291 | +- `src/simd.rs` lines 52-55 — existing `is_x86_feature_detected!` |
| 292 | + reporting (NOT dispatch) — repurpose for Phase 5 trampoline. |
| 293 | +- `src/simd_nightly/mod.rs` lines 37-44 — complete `pub use` set |
| 294 | + ready to be wired into `simd.rs` dispatch (Phase 2). |
| 295 | + |
| 296 | +## 9. TL;DR |
| 297 | + |
| 298 | +Default cargo config drops to **`x86-64-v3`** (AVX2) → GitHub CI green by |
| 299 | +default. **`.cargo/config-avx512.toml`** is the explicit AVX-512 path. |
| 300 | +`simd_avx2.rs` needs ~10 missing two-half wrappers (P0, Phase 1). |
| 301 | +`simd.rs` needs a `nightly-simd` dispatch arm so `simd_nightly/*` |
| 302 | +becomes reachable (P1, Phase 2). NEON gets quartet wrappers (P1, Phase |
| 303 | +3). Scalar / macros / runtime-dispatch / explicit-AVX-512 CI are |
| 304 | +P2-P3 follow-ups (Phases 4-6). Each phase is one PR; landing |
| 305 | +in order keeps every step green. |
0 commit comments