|
| 1 | +# SIMD Wishlist Audit: AdaWorldAPI/ndarray (March 2026) |
| 2 | + |
| 3 | +## Codebase Snapshot |
| 4 | + |
| 5 | +- **57 HPC modules** in `src/hpc/` (52K+ lines) |
| 6 | +- **2,846 lines** of portable SIMD polyfill: `src/simd.rs` → `src/simd_avx512.rs` → scalar fallback |
| 7 | +- SIMD types: `F32x16`, `F64x8`, `U8x64`, `I32x16`, `U32x16`, `U64x8` (AVX-512) + `f32x8`, `f64x4` (AVX2) |
| 8 | +- Full operator overloading, `mul_add` (FMA), `sqrt`, `reduce_sum/min/max`, `simd_clamp` |
| 9 | +- AVX2 dot product with 4× unrolled accumulators (`src/simd_avx2.rs:52`) |
| 10 | +- Runtime dispatch via `is_x86_feature_detected!` (65+ sites in `bitwise.rs`) |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## Wishlist Scorecard |
| 15 | + |
| 16 | +| # | Item | Status | Key Evidence | |
| 17 | +|---|---|---|---| |
| 18 | +| 1 | `simd_map()` | **PARTIAL** | SIMD types exist (`F32x16` etc.), VML scalar. Missing: generic lane iteration | |
| 19 | +| 2 | `SpatialArray3<T>` | **PARTIAL** | `cam_index.rs` CAM + `dn_tree.rs` spatial tree. Missing: f32 3D coordinates | |
| 20 | +| 3 | `xor_diff()` | **DONE** | `bitwise.rs` AVX-512BW/AVX2/scalar XOR + popcount | |
| 21 | +| 4 | `gather_scatter()` | **MISSING** | Only vpshufb nibble gathers in bitwise.rs | |
| 22 | +| 5 | `columnar_view()` | **PARTIAL** | `arrow_bridge.rs` schema + `SoakingBuffer`. Missing: `ArrayView` bridge | |
| 23 | +| 6 | `Zip::simd_apply()` | **PARTIAL** | `kernels.rs` K0→K1→K2 fusion. Missing: generic over closures | |
| 24 | +| 7 | `runtime_dispatch()` | **DONE** | 65+ `is_x86_feature_detected!` sites + scalar fallbacks | |
| 25 | +| 8 | `stencil()` | **MISSING** | BNN has neighbor patterns but no 3D stencil API | |
| 26 | +| 9 | `compact_palette()` | **PARTIAL** | `palette_distance.rs` 256-entry codebook + `quantized.rs` + vpshufb | |
| 27 | +| 10 | `prefetch/stream` | **PARTIAL** | `packed.rs` layout-for-prefetch. No explicit `_mm_prefetch` | |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Per-Item Detail |
| 32 | + |
| 33 | +### 1. `simd_map()` — Lane-Native SIMD Iteration |
| 34 | + |
| 35 | +**Exists:** `F32x16::from_slice()`, `copy_to_slice()`, `mul_add()`, `sqrt()`, all operators. AVX2 `dot_f32()` with 4-accumulator unrolling in `simd_avx2.rs:52-90`. |
| 36 | + |
| 37 | +**Exists:** `src/hpc/vml.rs` has `vsexp`, `vssqrt`, `vsln`, `vsabs`, `vsadd`, `vsmul`, `vsdiv` — but ALL are scalar loops. |
| 38 | + |
| 39 | +**Gap:** Need `vml.rs` to use `F32x16`/`f32x8` types. The types exist, the functions exist, they just aren't connected. Example: |
| 40 | +```rust |
| 41 | +// Current vml.rs: |
| 42 | +pub fn vssqrt(x: &[f32], out: &mut [f32]) { |
| 43 | + for (o, &v) in out.iter_mut().zip(x.iter()) { *o = v.sqrt(); } |
| 44 | +} |
| 45 | +// Should be: |
| 46 | +pub fn vssqrt(x: &[f32], out: &mut [f32]) { |
| 47 | + let chunks = x.len() / 16; |
| 48 | + for i in 0..chunks { |
| 49 | + let v = F32x16::from_slice(&x[i*16..]); |
| 50 | + v.sqrt().copy_to_slice(&mut out[i*16..]); |
| 51 | + } |
| 52 | + // scalar remainder |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +### 2. `SpatialArray3<T>` — Content-Addressable Memory |
| 57 | + |
| 58 | +**Exists:** `cam_index.rs` — multi-probe LSH CAM for 49,152-bit `GraphHV` binary vectors. `dn_tree.rs` — hierarchical spatial partitioning (739 lines). `parallel_search.rs` — dual-path HHTL + CLAM tree search. |
| 59 | + |
| 60 | +**Gap:** All CAM infrastructure operates on binary hypervectors, not f32 spatial coordinates. Need a `SpatialCam3D` adapter that uses spatial hashing (floor(x/cell_size)) for the Pumpkin entity bind/unbind pattern. |
| 61 | + |
| 62 | +### 3. `xor_diff()` — SIMD XOR Change Detection |
| 63 | + |
| 64 | +**DONE.** `bitwise.rs`: |
| 65 | +- `hamming_avx2()` (line 62): 32 bytes/iter via vpshufb |
| 66 | +- `hamming_avx512bw()` (line 117): 64 bytes/iter via vpshufb-512 |
| 67 | +- `hamming_avx512_vpopcnt()`: native VPOPCNTDQ when available |
| 68 | +- Runtime dispatch (line 234): `avx512vpopcntdq` → `avx512bw` → `avx2` → scalar |
| 69 | +- `hamming_query_batch()`: batch mode for tick N vs N+1 comparison |
| 70 | + |
| 71 | +**Only gap:** No sparse `nonzero_iter()` returning positions of changed elements. |
| 72 | + |
| 73 | +### 4. `gather_scatter()` — Vectorized Gather |
| 74 | + |
| 75 | +**MISSING.** `tekamolo.rs` and `cam_index.rs` use hash-based lookups (conceptually gather) but no `VPGATHERDD`/`VGATHERDPS` intrinsics anywhere. |
| 76 | + |
| 77 | +### 5. `columnar_view()` — Zero-Copy Arrow Interop |
| 78 | + |
| 79 | +**Exists:** `arrow_bridge.rs` has schema constants (`s_binary`, `p_binary`, `o_binary`, `node_id`), `GateState` lifecycle (Form→Flow→Freeze), `SoakingBuffer { data: Vec<i8>, n_entries, n_dims }`. |
| 80 | + |
| 81 | +**Gap:** Missing the one-liner: `unsafe { ArrayView1::from_shape_ptr(len, arrow_buf.as_ptr()) }`. |
| 82 | + |
| 83 | +### 6. `Zip::simd_apply()` — Multi-Array Fused SIMD Kernel |
| 84 | + |
| 85 | +**Exists:** `kernels.rs` K0→K1→K2 fused cascade (1589 lines). `packed.rs` stroke-aligned cascade query. Both fuse multiple passes into one traversal. |
| 86 | + |
| 87 | +**Gap:** Fusion is hardcoded for binary Hamming. Need generic version accepting `Fn(F32x16, F32x16) -> F32x16`. |
| 88 | + |
| 89 | +### 7. `runtime_dispatch()` — CPU Feature Detection |
| 90 | + |
| 91 | +**DONE.** Two complementary systems: |
| 92 | +1. `bitwise.rs`: 65+ `is_x86_feature_detected!` with 4-tier fallback |
| 93 | +2. `simd.rs` polyfill: compile-time dispatch via `#[cfg(target_arch)]` with scalar fallback types |
| 94 | + |
| 95 | +### 8. `stencil()` — 3D Neighbor-Aware SIMD |
| 96 | + |
| 97 | +**MISSING.** `bnn_causal_trajectory.rs`, `deepnsm.rs`, `clam_search.rs` have neighbor traversal patterns but nothing 3D-stencil-specific. |
| 98 | + |
| 99 | +### 9. `compact_palette()` — Bit-Packed SIMD |
| 100 | + |
| 101 | +**PARTIAL.** Three relevant modules: |
| 102 | +- `palette_distance.rs`: 256-entry `Palette` codebook with precomputed pairwise L1 distance matrix |
| 103 | +- `quantized.rs`: f32→u8 quantization with scale/zero-point |
| 104 | +- `bitwise.rs`: vpshufb nibble lookup (4-bit table, proven in SIMD) |
| 105 | + |
| 106 | +**Gap:** No variable-width (4-15 bit) pack/unpack for Minecraft block state encoding. |
| 107 | + |
| 108 | +### 10. `prefetch_region()` + `stream_store()` |
| 109 | + |
| 110 | +**PARTIAL.** `packed.rs` uses stroke-aligned layout for hardware prefetcher ("the prefetcher handles sequential access"). No explicit `_mm_prefetch` or `_mm_stream_ps`. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## What Changed Since Last Audit |
| 115 | + |
| 116 | +| New Module | Lines | Wishlist Impact | |
| 117 | +|---|---|---| |
| 118 | +| `src/simd.rs` | 829 | #1 #6 #7 — portable SIMD types with scalar fallback | |
| 119 | +| `src/simd_avx512.rs` | 1399 | #1 — F32x16/F64x8/U8x64 with FMA, sqrt, reduce | |
| 120 | +| `src/simd_avx2.rs` | 618 | #1 — f32x8/f64x4 dot product, GEMM tile sizes | |
| 121 | +| `hpc/holo.rs` | new | Phase + focus + carrier (94 tests) | |
| 122 | +| `hpc/zeck.rs` | new | Zeckendorf encoding + batch/top_k | |
| 123 | +| `hpc/palette_distance.rs` | new | #9 — 256-entry palette with O(1) distance | |
| 124 | +| `hpc/parallel_search.rs` | new | #2 — dual-path HHTL + CLAM search | |
| 125 | +| `hpc/layered_distance.rs` | new | O(1) distance via palette index + precomputed matrix | |
| 126 | +| `hpc/bgz17_bridge.rs` | new | Base17 bridge for palette interop | |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +*Audit generated 2026-03-23. AdaWorldAPI/ndarray master @ 11633d06.* |
0 commit comments