Skip to content

Commit c5bf928

Browse files
authored
Merge pull request #28 from AdaWorldAPI/claude/complete-ndarray-library-Op9kK
Claude/complete ndarray library op9k k
2 parents 9b65097 + 62f5574 commit c5bf928

2 files changed

Lines changed: 509 additions & 0 deletions

File tree

SIMD_COMPARISON.md

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
# SIMD Wishlist Audit: AdaWorldAPI/ndarray (March 2026)
2+
3+
## Codebase Snapshot
4+
5+
- **57 HPC modules** in `src/hpc/` (52K+ lines)
6+
- **2,846 lines** of portable SIMD polyfill: `src/simd.rs``src/simd_avx512.rs` → scalar fallback
7+
- SIMD types: `F32x16`, `F64x8`, `U8x64`, `I32x16`, `U32x16`, `U64x8` (AVX-512) + `f32x8`, `f64x4` (AVX2)
8+
- Full operator overloading, `mul_add` (FMA), `sqrt`, `reduce_sum/min/max`, `simd_clamp`
9+
- AVX2 dot product with 4× unrolled accumulators (`src/simd_avx2.rs:52`)
10+
- Runtime dispatch via `is_x86_feature_detected!` (65+ sites in `bitwise.rs`)
11+
12+
---
13+
14+
## Wishlist Scorecard
15+
16+
| # | Item | Status | Key Evidence |
17+
|---|---|---|---|
18+
| 1 | `simd_map()` | **PARTIAL** | SIMD types exist (`F32x16` etc.), VML scalar. Missing: generic lane iteration |
19+
| 2 | `SpatialArray3<T>` | **PARTIAL** | `cam_index.rs` CAM + `dn_tree.rs` spatial tree. Missing: f32 3D coordinates |
20+
| 3 | `xor_diff()` | **DONE** | `bitwise.rs` AVX-512BW/AVX2/scalar XOR + popcount |
21+
| 4 | `gather_scatter()` | **MISSING** | Only vpshufb nibble gathers in bitwise.rs |
22+
| 5 | `columnar_view()` | **PARTIAL** | `arrow_bridge.rs` schema + `SoakingBuffer`. Missing: `ArrayView` bridge |
23+
| 6 | `Zip::simd_apply()` | **PARTIAL** | `kernels.rs` K0→K1→K2 fusion. Missing: generic over closures |
24+
| 7 | `runtime_dispatch()` | **DONE** | 65+ `is_x86_feature_detected!` sites + scalar fallbacks |
25+
| 8 | `stencil()` | **MISSING** | BNN has neighbor patterns but no 3D stencil API |
26+
| 9 | `compact_palette()` | **PARTIAL** | `palette_distance.rs` 256-entry codebook + `quantized.rs` + vpshufb |
27+
| 10 | `prefetch/stream` | **PARTIAL** | `packed.rs` layout-for-prefetch. No explicit `_mm_prefetch` |
28+
29+
---
30+
31+
## Per-Item Detail
32+
33+
### 1. `simd_map()` — Lane-Native SIMD Iteration
34+
35+
**Exists:** `F32x16::from_slice()`, `copy_to_slice()`, `mul_add()`, `sqrt()`, all operators. AVX2 `dot_f32()` with 4-accumulator unrolling in `simd_avx2.rs:52-90`.
36+
37+
**Exists:** `src/hpc/vml.rs` has `vsexp`, `vssqrt`, `vsln`, `vsabs`, `vsadd`, `vsmul`, `vsdiv` — but ALL are scalar loops.
38+
39+
**Gap:** Need `vml.rs` to use `F32x16`/`f32x8` types. The types exist, the functions exist, they just aren't connected. Example:
40+
```rust
41+
// Current vml.rs:
42+
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
43+
for (o, &v) in out.iter_mut().zip(x.iter()) { *o = v.sqrt(); }
44+
}
45+
// Should be:
46+
pub fn vssqrt(x: &[f32], out: &mut [f32]) {
47+
let chunks = x.len() / 16;
48+
for i in 0..chunks {
49+
let v = F32x16::from_slice(&x[i*16..]);
50+
v.sqrt().copy_to_slice(&mut out[i*16..]);
51+
}
52+
// scalar remainder
53+
}
54+
```
55+
56+
### 2. `SpatialArray3<T>` — Content-Addressable Memory
57+
58+
**Exists:** `cam_index.rs` — multi-probe LSH CAM for 49,152-bit `GraphHV` binary vectors. `dn_tree.rs` — hierarchical spatial partitioning (739 lines). `parallel_search.rs` — dual-path HHTL + CLAM tree search.
59+
60+
**Gap:** All CAM infrastructure operates on binary hypervectors, not f32 spatial coordinates. Need a `SpatialCam3D` adapter that uses spatial hashing (floor(x/cell_size)) for the Pumpkin entity bind/unbind pattern.
61+
62+
### 3. `xor_diff()` — SIMD XOR Change Detection
63+
64+
**DONE.** `bitwise.rs`:
65+
- `hamming_avx2()` (line 62): 32 bytes/iter via vpshufb
66+
- `hamming_avx512bw()` (line 117): 64 bytes/iter via vpshufb-512
67+
- `hamming_avx512_vpopcnt()`: native VPOPCNTDQ when available
68+
- Runtime dispatch (line 234): `avx512vpopcntdq``avx512bw``avx2` → scalar
69+
- `hamming_query_batch()`: batch mode for tick N vs N+1 comparison
70+
71+
**Only gap:** No sparse `nonzero_iter()` returning positions of changed elements.
72+
73+
### 4. `gather_scatter()` — Vectorized Gather
74+
75+
**MISSING.** `tekamolo.rs` and `cam_index.rs` use hash-based lookups (conceptually gather) but no `VPGATHERDD`/`VGATHERDPS` intrinsics anywhere.
76+
77+
### 5. `columnar_view()` — Zero-Copy Arrow Interop
78+
79+
**Exists:** `arrow_bridge.rs` has schema constants (`s_binary`, `p_binary`, `o_binary`, `node_id`), `GateState` lifecycle (Form→Flow→Freeze), `SoakingBuffer { data: Vec<i8>, n_entries, n_dims }`.
80+
81+
**Gap:** Missing the one-liner: `unsafe { ArrayView1::from_shape_ptr(len, arrow_buf.as_ptr()) }`.
82+
83+
### 6. `Zip::simd_apply()` — Multi-Array Fused SIMD Kernel
84+
85+
**Exists:** `kernels.rs` K0→K1→K2 fused cascade (1589 lines). `packed.rs` stroke-aligned cascade query. Both fuse multiple passes into one traversal.
86+
87+
**Gap:** Fusion is hardcoded for binary Hamming. Need generic version accepting `Fn(F32x16, F32x16) -> F32x16`.
88+
89+
### 7. `runtime_dispatch()` — CPU Feature Detection
90+
91+
**DONE.** Two complementary systems:
92+
1. `bitwise.rs`: 65+ `is_x86_feature_detected!` with 4-tier fallback
93+
2. `simd.rs` polyfill: compile-time dispatch via `#[cfg(target_arch)]` with scalar fallback types
94+
95+
### 8. `stencil()` — 3D Neighbor-Aware SIMD
96+
97+
**MISSING.** `bnn_causal_trajectory.rs`, `deepnsm.rs`, `clam_search.rs` have neighbor traversal patterns but nothing 3D-stencil-specific.
98+
99+
### 9. `compact_palette()` — Bit-Packed SIMD
100+
101+
**PARTIAL.** Three relevant modules:
102+
- `palette_distance.rs`: 256-entry `Palette` codebook with precomputed pairwise L1 distance matrix
103+
- `quantized.rs`: f32→u8 quantization with scale/zero-point
104+
- `bitwise.rs`: vpshufb nibble lookup (4-bit table, proven in SIMD)
105+
106+
**Gap:** No variable-width (4-15 bit) pack/unpack for Minecraft block state encoding.
107+
108+
### 10. `prefetch_region()` + `stream_store()`
109+
110+
**PARTIAL.** `packed.rs` uses stroke-aligned layout for hardware prefetcher ("the prefetcher handles sequential access"). No explicit `_mm_prefetch` or `_mm_stream_ps`.
111+
112+
---
113+
114+
## What Changed Since Last Audit
115+
116+
| New Module | Lines | Wishlist Impact |
117+
|---|---|---|
118+
| `src/simd.rs` | 829 | #1 #6 #7 — portable SIMD types with scalar fallback |
119+
| `src/simd_avx512.rs` | 1399 | #1 — F32x16/F64x8/U8x64 with FMA, sqrt, reduce |
120+
| `src/simd_avx2.rs` | 618 | #1 — f32x8/f64x4 dot product, GEMM tile sizes |
121+
| `hpc/holo.rs` | new | Phase + focus + carrier (94 tests) |
122+
| `hpc/zeck.rs` | new | Zeckendorf encoding + batch/top_k |
123+
| `hpc/palette_distance.rs` | new | #9 — 256-entry palette with O(1) distance |
124+
| `hpc/parallel_search.rs` | new | #2 — dual-path HHTL + CLAM search |
125+
| `hpc/layered_distance.rs` | new | O(1) distance via palette index + precomputed matrix |
126+
| `hpc/bgz17_bridge.rs` | new | Base17 bridge for palette interop |
127+
128+
---
129+
130+
*Audit generated 2026-03-23. AdaWorldAPI/ndarray master @ 11633d06.*

0 commit comments

Comments
 (0)