Skip to content

Commit b0fcc07

Browse files
committed
docs: fix dispatch model in refactor docs — cfg routing, not runtime detection
UNIFIED_REFACTOR_SEQUENCE.md: - Add "Dispatch Model" section documenting compile-time cfg(target_feature) routing - Wave 1 contract: replace "three backends" with correct per-file impl rule - Replace rules 6-7 with: no is_x86_feature_detected, no #[target_feature(enable)] - Wave 5: reframe as "delete dead detection code" not "unify runtime singleton" - Add rules 9-10 (don't touch simd_avx2.rs, don't reach for rayon) REFACTOR_HPC_INTEGRATION.md: - §3.2: replace LazyLock<CpuCaps> proposal with "delete 877 lines of dead code" - Architecture diagram: "backend dispatch" → "cfg(target_feature) routing" - Phase C execution order updated to match Keeps: all type bridges, extension traits, SoA cascade, Wave sequencing, W1a primitive specs, VPABSB correction, palette-256 priority, NEON 2×128-bit, Arrow integration, codegen macros, namespace restructure, effort estimates. https://claude.ai/code/session_01EHNZhSmJ52FGyDxtCFgzXo
1 parent 9b80bc4 commit b0fcc07

2 files changed

Lines changed: 103 additions & 54 deletions

File tree

REFACTOR_HPC_INTEGRATION.md

Lines changed: 35 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -36,26 +36,32 @@ The refactoring creates **bidirectional bridges** without removing the raw layer
3636
│ Array<f32, Ix2>, ArrayView, Zip, Broadcasting │
3737
└───────────────┬──────────────────────▲──────────────┘
3838
│ │
39-
.as_slice() backend dispatch
40-
39+
.as_slice() cfg(target_feature) routing
40+
(compile-time, zero-cost)
4141
▼ │
4242
┌─────────────────────────────────────────────────────┐
4343
│ hpc/ bridge layer (NEW) │
4444
│ Extension traits on ArrayBase<S, D> │
4545
│ From/Into impls for domain types │
46-
│ Core reductions route to SIMD backends
46+
│ Core reductions call typed SIMD wrappers
4747
└───────────────┬──────────────────────▲──────────────┘
4848
│ │
4949
delegates to implements
5050
│ │
5151
▼ │
5252
┌─────────────────────────────────────────────────────┐
5353
│ hpc/ raw compute (unchanged) │
54-
│ &[u8], &[u64], Fingerprint<N>, SIMD dispatch
54+
│ &[u8], &[u64], Fingerprint<N>, typed SIMD
5555
│ K0/K1/K2, BF16 GEMM, VNNI, VML │
56+
│ Uses crate::simd::* → resolves to simd_avx512.rs │
5657
└─────────────────────────────────────────────────────┘
5758
```
5859

60+
**Dispatch model**: `crate::simd::U64x8` resolves at compile time via
61+
`cfg(target_feature = "avx512f")` in `src/simd.rs` to `simd_avx512::U64x8`.
62+
No runtime detection, no match, no branching. The target-cpu pin in
63+
`.cargo/config.toml` makes the cfg gate TRUE at compile time.
64+
5965
---
6066

6167
## Refactoring Shopping List
@@ -608,44 +614,37 @@ faster. Zero API change for users.
608614

609615
---
610616

611-
#### 3.2 Unify SIMD Detection (Delete Duplicates)
617+
#### 3.2 Delete Dead SIMD Detection Code
612618

613-
**Files to merge**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
614-
**Into**: `src/simd.rs` (the existing core SIMD module)
619+
**Files**: `src/hpc/simd_dispatch.rs` (362 lines), `src/hpc/simd_caps.rs` (515 lines)
615620

616-
**Current state**: Three overlapping detection systems:
617-
1. `src/simd.rs``LazyLock<Tier>` — detects AVX-512/AVX2/NEON
618-
2. `src/hpc/simd_caps.rs``CpuCaps` struct — detects same capabilities
619-
3. `src/hpc/simd_dispatch.rs``SimdDispatch` — another LazyLock with function pointers
621+
**Current state**: Three overlapping detection systems exist:
622+
1. `src/simd.rs``LazyLock<Tier>` + `cfg(target_feature)` re-exports
623+
2. `src/hpc/simd_caps.rs``CpuCaps` struct with runtime detection
624+
3. `src/hpc/simd_dispatch.rs``SimdDispatch` with function pointers
620625

621-
**Transform**:
626+
**Reality with target-cpu pinned**: When `.cargo/config.toml` pins
627+
`target-cpu=sapphirerapids`, all `cfg(target_feature = "avx512f")` gates resolve
628+
TRUE at compile time. `simd.rs` re-exports route directly to `simd_avx512.rs`.
629+
The `LazyLock<Tier>` is dead code (const-folded away). The CpuCaps struct and
630+
SimdDispatch function pointers are completely unreachable.
622631

623-
```rust
624-
// 1. In src/simd.rs, export the unified detection:
625-
pub static CPU_CAPS: LazyLock<CpuCaps> = LazyLock::new(|| detect_cpu_caps());
626-
627-
pub struct CpuCaps {
628-
pub tier: Tier, // existing
629-
pub avx512f: bool,
630-
pub avx512bw: bool,
631-
pub avx512vpopcntdq: bool,
632-
pub avx512vnni: bool,
633-
pub avx2: bool,
634-
pub fma: bool,
635-
pub popcnt: bool,
636-
// ARM
637-
pub neon: bool,
638-
pub sve: bool,
639-
}
632+
**Transform**:
640633

641-
// 2. In src/hpc/simd_dispatch.rs, replace with:
642-
pub use crate::simd::CPU_CAPS;
643-
// Keep function-pointer dispatch but source caps from unified singleton
634+
```
635+
1. Delete src/hpc/simd_caps.rs (515 lines — all dead under cfg pin)
636+
2. Delete src/hpc/simd_dispatch.rs (362 lines — all dead under cfg pin)
637+
3. Leave src/simd.rs as-is (its cfg gates ARE the dispatch mechanism)
638+
```
644639

645-
// 3. Delete src/hpc/simd_caps.rs (or make it a thin re-export)
640+
If CI fallback (no target-cpu pin) is needed, gate these behind:
641+
```rust
642+
#[cfg(not(target_feature = "avx512f"))]
643+
mod simd_caps; // only compiles when features aren't pinned
646644
```
647645

648-
**Result**: One CPUID check, one atomic, one struct — shared by core and all hpc modules.
646+
**Result**: -877 lines of dead code. The dispatch mechanism is `cfg(target_feature)`
647+
in `simd.rs` — no runtime anything.
649648

650649
---
651650

@@ -913,8 +912,8 @@ Phase B (2 weeks) — Extension Traits:
913912
2.5 SimdMath trait (VML) (standalone)
914913
915914
Phase C (2 weeks) — Backend Wiring:
916-
3.1 Core sum/mean → SIMD dispatch (depends on 3.2)
917-
3.2 Unified SIMD detection (standalone, delete duplicates)
915+
3.1 Core sum/mean → typed SIMD (standalone — calls crate::simd::F32x16 directly)
916+
3.2 Delete dead detection code (standalone, -877 lines unreachable under cfg pin)
918917
3.3 INT8 Matmul via trait (depends on 1.3 pattern)
919918
920919
Phase D (1 week) — View Factories:

UNIFIED_REFACTOR_SEQUENCE.md

Lines changed: 68 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,35 @@
1111
1212
---
1313

14+
## Dispatch Model (the ground truth)
15+
16+
All SIMD in this repo resolves via **compile-time `cfg(target_feature)` routing**:
17+
18+
```
19+
.cargo/config.toml pins target-cpu=sapphirerapids
20+
→ cfg(target_feature = "avx512f") = TRUE at compile time
21+
→ simd.rs re-exports resolve DIRECTLY to simd_avx512.rs types
22+
→ zero runtime detection, zero branching, zero LazyLock in hot path
23+
24+
Consumer writes: crate::simd::U64x8
25+
simd.rs routes: pub use crate::simd_avx512::U64x8; // compile-time, no match
26+
```
27+
28+
The `LazyLock<Tier>` in `simd.rs` exists for the `std` path but is **dead code**
29+
when target features are pinned — the compiler const-folds `detect_tier()` and
30+
the `cfg` gates resolve the re-exports statically.
31+
32+
**On aarch64**: NEON is mandatory. `cfg(target_arch = "aarch64")` routes to
33+
`simd_neon.rs`. 256-bit types are 2×128-bit paired dispatch (e.g. `F32x16` =
34+
`float32x4_t`). No runtime detection needed — NEON is always there.
35+
36+
**simd_avx2.rs**: Dead code on x86-64-v4. Only reached when
37+
`cfg(not(target_feature = "avx512f"))` — i.e., when someone builds without the
38+
target-cpu pin (CI fallback, cross-compile). Never add new methods to it for
39+
AVX-512 targets; it's a waste.
40+
41+
---
42+
1443
## Wave 0 — Conventions & Foundations (3 days)
1544

1645
Unlocks everything else. No code changes to hot paths. Pure contract + tooling.
@@ -46,13 +75,14 @@ The lance-graph `simd-savant` agent runs PRE-MERGE against each PR.
4675
### Per-primitive implementation contract
4776

4877
Every PR MUST:
49-
1. **Three backends**: AVX-512, NEON, scalar. Missing scalar = P0 reject.
78+
1. **Implement on the backing type in `simd_avx512.rs`** (the only live file on x86-64-v4). NEON impl goes in `simd_neon.rs`. Scalar fallback in the `scalar` module inside `simd.rs`.
5079
2. **Edge-case semantics documented** in doc-comment (`i8::MIN`, empty slices, OOB indices).
51-
3. **Parity test**: all backends produce identical output on randomized corpus including edge cases.
80+
3. **Parity test**: all cfg-routed backends produce identical output on randomized corpus including edge cases.
5281
4. **Bench against scalar**: record AVX-512/NEON speedup ratios in PR body.
5382
5. **`// SAFETY:` on every `unsafe` block**.
54-
6. **No new `is_x86_feature_detected!`** outside `simd_caps.rs`.
55-
7. **Consumer site cited** in PR description.
83+
6. **No `is_x86_feature_detected!` anywhere** — dispatch is at `cfg(target_feature)` in `simd.rs` re-exports, not per-call runtime checks.
84+
7. **No `#[target_feature(enable = ...)]` on functions** — the cargo target-cpu pin handles this globally.
85+
8. **Consumer site cited** in PR description.
5686

5787
### W1.1 — I8x16::from_i4_packed_u64 (nibble unpack + sign extend)
5888

@@ -110,7 +140,18 @@ impl U16x8 {
110140
pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
111141
```
112142

113-
**AVX2**: `_mm256_i32gather_epi32` with index widening + downcast.
143+
**Palette-256 is the dominant use case** — every index fits in u8, table is always
144+
256 or 512 bytes. No bounds risk at palette-256 widths. The gather_u16 API must
145+
handle arbitrary tables too, but palette-256 should be the fast path (no OOB
146+
possible when table.len() == 256 and indices are u8-sourced).
147+
148+
**Codex P2 fix (gather_u16 OOB)**: `_mm256_i32gather_epi32` reads 4 bytes per slot
149+
from a `&[u16]` table — overreads 2 bytes at `table[len-1]`. For palette-256 this
150+
is harmless (256 × 2 = 512 bytes, 4-byte read at index 255 reads bytes 510-513,
151+
which is within a 512-byte aligned allocation + padding). For arbitrary tables,
152+
use scalar fallback or pad the table allocation.
153+
154+
**AVX-512**: `_mm512_i32gather_epi32` with index widening + mask to u16.
114155
**NEON / Scalar**: loop `(0..8).map(|i| table[indices.lane(i)])`.
115156

116157
### W1.4 — Prefetch hints (cross-arch)
@@ -134,12 +175,13 @@ impl U64x8 {
134175
/// XOR + lane-wise popcount + horizontal sum. Optimized for Hamming distance.
135176
pub fn xor_popcount(self, other: Self) -> u64;
136177
}
137-
impl U64x4 { pub fn popcnt(self) -> Self; } // AVX2 parity
178+
impl U64x4 { pub fn popcnt(self) -> Self; } // NEON/scalar parity
138179
```
139180

140-
**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`).
141-
**AVX-512 without VPOPCNTDQ**: Mula's algorithm via `VPSHUFB` + `VPSADBW` per-byte LUT.
142-
**NEON**: `vcntq_u8``vpaddlq_u8` cascade to sum within each u64.
181+
**AVX-512 VPOPCNTDQ**: `_mm512_popcnt_epi64` directly (feature `avx512vpopcntdq`
182+
available on sapphirerapids, enabled by the target-cpu pin).
183+
**NEON popcount per-u64**: `vcntq_u8``vpaddlq_u8``vpaddlq_u16``vpaddlq_u32`
184+
(NOT `vaddvq_u8` which merges ALL lanes to a single scalar — Codex P2 fix).
143185
**Scalar**: `u64::count_ones()` fused loop.
144186

145187
### W1.5+ — Deferred primitives (gated on sigker certification)
@@ -170,7 +212,7 @@ Wave 1 primitives become the first consumers of these macros (retrofit optional)
170212
| ID | Item | Source | Effort | What It Produces |
171213
|----|------|--------|--------|------------------|
172214
| W2.1 | **Dtype-parity macro** (`reductions_for!`) | R3.1 | 1d | One line = 7 reductions for a dtype. Cuts 700→150 lines. |
173-
| W2.2 | **Per-arch dispatch macro** (`simd_dispatch!`) | R3.2 | 4h | Eliminates dispatch skeleton copy-paste. |
215+
| W2.2 | **Per-arch impl macro** (`simd_impl!`) | R3.2 | 4h | Generates the struct + methods for a type in both simd_avx512.rs and simd_neon.rs from one body. |
174216
| W2.3 | **Reduction kernel template** (`reduce_simd()`) | R3.3 | 4h | Generic chunk-loop; sum/max/nrm2 become 5-line callers. |
175217
| W2.4 | **Dual-form fusion** (`kernel_simd_dual!`) | R3.4 | 1d | One body → `_into`, Vec, `_ptr`, all arch variants. |
176218

@@ -219,10 +261,17 @@ From REFACTOR_HPC_INTEGRATION.md Tier 3. Core ndarray operations silently accele
219261

220262
| ID | Item | Source | Effort | Impact |
221263
|----|------|--------|--------|--------|
222-
| W5.1 | **Unified SIMD detection** (merge simd_caps + simd_dispatch into core) | Tier 3.2 | 4h | Deletes 877 lines of duplication |
264+
| W5.1 | **Delete duplicate detection code** (simd_caps + simd_dispatch → dead code under cfg pin) | Tier 3.2 | 4h | Deletes 877 lines that are unreachable when target-cpu is pinned |
223265
| W5.2 | **Core sum/mean → SIMD dispatch** | Tier 3.1 | 4h | 16x faster `.sum()` on contiguous f32/f64 |
224266
| W5.3 | **SIMD axis reductions** (sum_axis with SIMD lanes) | Tier 6.1 | 1d | ML training hot path |
225267

268+
**Note on W5.1**: With `target-cpu=sapphirerapids` in `.cargo/config.toml`, the
269+
`cfg(target_feature = "avx512f")` branch in `simd.rs` is the only live path.
270+
`hpc/simd_caps.rs` (515 lines) and `hpc/simd_dispatch.rs` (362 lines) exist for
271+
a multi-binary world that doesn't apply when the target is pinned. These can be
272+
deleted or feature-gated behind `cfg(not(target_feature = "avx512f"))` for CI
273+
fallback builds.
274+
226275
**Gate**: `cargo bench` shows measurable improvement on contiguous arrays.
227276
Non-contiguous arrays unchanged (fallback to generic fold).
228277

@@ -299,14 +348,14 @@ Wave 0 (conventions)
299348
│ ├──→ Wave 6.7 (QualiaColumns uses I8x16::from_i4_packed from W1.1)
300349
│ └──→ Wave 8.4 (primitive parity benches)
301350
302-
─→ Wave 2 (codegen macros)
351+
���─→ Wave 2 (codegen macros)
303352
│ │
304353
│ └──→ Wave 5 (backend wiring uses macros)
305354
306355
├──→ Wave 3 (type bridges)
307356
│ │
308357
│ ├──→ Wave 4 (extension traits need bridges)
309-
│ ├─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
358+
│ ├���─→ Wave 5 (backend dispatch needs BlasFloat for BF16)
310359
│ └──→ Wave 6 (SoA needs Fingerprint↔Array, Arrow views)
311360
312361
├──→ Wave 7 (namespace) — independent of 1-6
@@ -320,7 +369,7 @@ Wave 0 (conventions)
320369

321370
**For lance-graph (P0 consumer)**: 0 → 1 = **6 days**. Consumer migration PRs unblocked.
322371

323-
**For full release**: 0 → 1 → 3 → 5 → 6 8 → 9 = **18 days** serial.
372+
**For full release**: 0 → 1 → 3 → 5 → 6 �� 8 → 9 = **18 days** serial.
324373
With parallelism (1∥2, 3∥4, 6∥7, 8∥backlog): **~12 working days**.
325374

326375
---
@@ -333,7 +382,7 @@ The 5 SIMD primitives aren't isolated additions — they're load-bearing for lat
333382
|-----------|----------------------------------|--------------------------|
334383
| `I8x16::from_i4_packed_u64` | `mul.rs::i4_eval::batch` (5 batch fns) | **W6.7 QualiaColumns** — unpack i4 qualia from packed storage without scalar loop |
335384
| `I8x16::saturating_abs` | Direction-B fix, ValleyOfDespair classifier | **W4.2 Quantize trait** — safe abs in quantization error metrics |
336-
| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** gather exponent fields from lookup table |
385+
| `U16x8::gather_u16` | `bgz17/simd.rs` palette lookup | **W6.6 BF16FieldDatabase** �� gather exponent fields from lookup table |
337386
| `prefetch_read_t0/t1/t2` | `bgz17/prefetch.rs` tile prefetch | **W6.2 k0_columnar_simd** — prefetch next column chunk during K0 scan |
338387
| `U64x8::popcnt` + `xor_popcount` | `holograph/hamming.rs` + `blasgraph/types.rs` | **W6.2-W6.5 entire SoA cascade** — columnar XOR+popcount is THE operation |
339388

@@ -351,12 +400,11 @@ slices, losing the typed-wrapper discipline and duplicating dispatch logic.
351400
| `I8x16::from_i4_packed_u64()` for qualia | **Wave 1.1** (SIMD primitives) |
352401
| `prefetch_read_t0()` for column prefetch | **Wave 1.4** (SIMD primitives) |
353402
| F-order Array2<u64> database | **Wave 3** (type bridges — Fingerprint converts to Array) |
354-
| Dispatch macro for k0_columnar_simd | **Wave 2** (codegen macros) |
403+
| Impl macro for k0_columnar_simd | **Wave 2** (codegen macros) |
355404
| `_into` form for columnar kernels | **Wave 0** (signature convention) |
356405
| Extension trait: `database.cascade_soa(&query, &gate)` | **Wave 4** (HdcOps trait extended) |
357406
| BF16FieldDatabase uses Quantize trait | **Wave 4** (Quantize extension) |
358407
| Arrow columns → direct scan | **Wave 3** (Arrow view factories) |
359-
| Unified SIMD detection for dispatch | **Wave 5** (backend wiring) |
360408
| Benchmark harness to prove 4-8x | **Wave 8** (bench infrastructure) |
361409

362410
---
@@ -437,5 +485,7 @@ that applies recursively across the entire surface:
437485
4. **Don't couple SoA with module restructure** — they're independent; merge separately
438486
5. **Don't break downstream in one shot** — deprecation shims for one release minimum
439487
6. **Don't ship W1a primitives without parity tests** — the codex P2 i8::MIN divergence on PR #398 happened because no such test existed
440-
7. **Don't add `is_x86_feature_detected!` outside simd_caps.rs**dispatch through the singleton
488+
7. **Don't use `is_x86_feature_detected!` or `#[target_feature(enable=...)]`**cfg(target_feature) at the re-export level handles dispatch; per-function annotations and per-call runtime checks are wrong
441489
8. **Don't implement W1.5+ (deferred primitives) until sigker certification** — they're gated
490+
9. **Don't add methods to simd_avx2.rs for AVX-512 targets** — it's dead code on x86-64-v4, never reached via cfg routing
491+
10. **Don't use rayon work-stealing** if the type-system integration (Wave 3-4) is the global lever — typed SIMD dispatch across the full surface eliminates the slicing problem that rayon would paper over

0 commit comments

Comments
 (0)