Skip to content

Commit d0627b8

Browse files
authored
Merge pull request #149 from AdaWorldAPI/claude/vertical-simd-consumer-contract-w1a-spec
docs(simd): W1a consumer contract — 5 primitive specs + VPABSB correction
2 parents 3441060 + 7c2161b commit d0627b8

2 files changed

Lines changed: 329 additions & 0 deletions

File tree

Lines changed: 328 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,328 @@
1+
# KNOWLEDGE: Vertical SIMD — W1a Consumer Contract
2+
3+
## READ BY:
4+
- `savant-architect` agent — before designing any new public `pub fn` in `src/simd_*.rs`
5+
- `sentinel-qa` agent — when auditing the saturating / bounds-aware / scalar-fallback discipline on a SIMD addition
6+
- Any contributor opening a PR that adds an `impl` block on `F32x16` / `I8x16` / `U8x32` / `U64x8` etc.
7+
- Any contributor adding a new public function under `src/simd_ops.rs` or `src/simd_int_ops.rs`
8+
9+
## P0 TRIGGERS:
10+
- About to file a PR adding `pub fn` to `src/simd_*.rs` → read this first
11+
- About to claim "X SIMD instruction saturates by ISA" → read §"VPABSB correction" first
12+
- Five `TD-NDARRAY-SIMD-*` issues are about to be filed against this repo from the `AdaWorldAPI/lance-graph` consumer contract → those are the W1a queue described below
13+
14+
---
15+
16+
## Why this doc exists
17+
18+
`AdaWorldAPI/lance-graph` (the obligatory spine for the Ada architecture) carries a hard architectural invariant: **all SIMD must come from `ndarray::simd` via the polyfill — `simd.rs` + `simd_ops.rs` > `simd_{type}.rs` per-arch. Raw intrinsics outside `ndarray/src/simd_*.rs` are a violation**, enforced by the `simd-savant` agent at `lance-graph:.claude/agents/simd-savant.md`.
19+
20+
A PRE-MERGE audit of `lance-graph` main on 2026-05-16 surfaced **158 raw-intrinsic violations across 5 consumer crates** plus **3 missing primitives** in `ndarray::simd` that block clean remediation. The lance-graph side is staged to migrate (in 5 sequential consumer PRs); the missing primitives must land in ndarray FIRST. This doc is the contract for what those primitives must do, with implementation details called out where consumer-side correctness depends on getting the semantics right.
21+
22+
The architectural shape this doc serves is captured in detail at:
23+
- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — the canonical reference, "alien magic" framing
24+
- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — the consumer-side enforcement card
25+
- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` (2026-05-16) — the 158-violation finding
26+
27+
---
28+
29+
## The pattern (one paragraph)
30+
31+
ndarray's SIMD surface is shaped to fit exactly what the Ada stack vertically needs — not as a generic library that consumers wrap, but as **struct methods on typed wrappers** (`I8x16`, `U8x32`, `F32x16`, `U64x8`, …) plus **closure-parameterized batch primitives** that absorb the consumer's domain semantics. Consumers see zero raw intrinsics, zero `cfg(target_arch)`, zero runtime feature-detect — they call `I8x16::from_i4_packed_u64(...)`, `I8x16::saturating_abs(...)`, `batch_packed_i4_16(..., |lanes, aux| { ... })`. The polyfill owns the runtime feature dispatch, lane chunking, tail handling, and scalar fallback. Per-arch code lives in `simd_avx512.rs` / `simd_neon.rs` / `simd_wasm.rs`; nothing arch-specific leaks above the `src/simd*.rs` namespace.
32+
33+
---
34+
35+
## VPABSB correction (P0 — read before implementing saturating_abs)
36+
37+
**`_mm512_abs_epi8` (VPABSB) does NOT saturate `i8::MIN`.** The Intel intrinsic returns the same bit pattern for `0x80` — i.e., `abs(i8::MIN) = i8::MIN` because `+128` does not fit in `i8`. An earlier draft of the consumer contract (2026-05-16 morning) claimed the instruction saturated `i8::MIN → 127` by ISA. Codex caught this on `lance-graph` PR #400; the correction is binding.
38+
39+
**Correct AVX-512 implementation of `I8x16::saturating_abs`:**
40+
41+
```rust
42+
// AVX-512 path
43+
let raw_abs = unsafe { _mm512_abs_epi8(self.0) };
44+
let clamped = unsafe {
45+
_mm512_min_epu8(raw_abs, _mm512_set1_epi8(0x7f))
46+
};
47+
I8x16(clamped)
48+
```
49+
50+
The mechanic:
51+
1. **VPABSB** computes the bit-pattern absolute value lane-wise. For `0x80` it returns `0x80` (the bit pattern of `+128` interpreted as unsigned). For everything else, `abs(x) < 0x80`, so the result fits in `i8` correctly.
52+
2. **VPMINUB** (unsigned-byte min) then clamps `0x80` (=128 unsigned) down to `0x7f` (=127). All lanes with `abs(x) < 0x80` are unaffected because `min_epu8(x, 0x7f) = x` for `x ≤ 0x7f` and `min_epu8(0x80, 0x7f) = 0x7f`.
53+
54+
Equivalent NEON:
55+
```rust
56+
// vqabsq_s8 is hardware-saturating (the `q` suffix means saturating)
57+
I8x16(unsafe { vqabsq_s8(self.0) })
58+
// Returns 127 for i8::MIN, identical to the AVX-512 + clamp result
59+
```
60+
61+
Scalar fused-loop:
62+
```rust
63+
for lane in 0..16 {
64+
out[lane] = input[lane].saturating_abs(); // stdlib, well-defined
65+
}
66+
```
67+
68+
**Mandatory test** (binding for the PR):
69+
```rust
70+
#[test]
71+
fn saturating_abs_i8_min_matches_across_backends() {
72+
let input = I8x16::splat(i8::MIN);
73+
let result = input.saturating_abs();
74+
assert_eq!(result.lane_i8::<0>(), i8::MAX);
75+
// ... and assert all 16 lanes equal i8::MAX
76+
}
77+
```
78+
79+
Any saturating-abs primitive in ndarray that does NOT produce `i8::MAX` for `i8::MIN` input is broken. The widen-then-negate trick (i8 → i64, then negate, then compare against threshold) used in `lance-graph` PR #398's mul.rs is a different mechanism and **not a substitute** — the new `I8x16::saturating_abs` must produce the saturating result in the same byte-wide register without widening, because downstream consumers will rely on byte-wide semantics for tight i4/i8 packed loops.
80+
81+
---
82+
83+
## W1a queue — 5 primitives ndarray must ship
84+
85+
Each is a tight-scope PR. Recommended: one branch per primitive, parallel review.
86+
87+
### W1a-#1`TD-NDARRAY-SIMD-UNPACK-I4-16D`
88+
89+
**Purpose:** unpack a `u64` of 16 packed signed nibbles (i4) into an `I8x16` with sign extension. Plus the closure-batch entry that the consumer's `mul::i4_eval::batch` dispatch calls.
90+
91+
**API surface:**
92+
```rust
93+
impl I8x16 {
94+
/// Unpack 16 signed i4 nibbles from a u64 into 16 i8 lanes
95+
/// (sign-extended). Nibble layout: lane[i] = sign_extend_4((packed >> (4*i)) & 0xf, i8).
96+
pub fn from_i4_packed_u64(packed: u64) -> Self;
97+
98+
/// Const-folded lane extract.
99+
pub fn lane_i8<const N: usize>(self) -> i8;
100+
}
101+
102+
/// Closure-parameterized batch: run `f` over each (unpacked_i8x16, aux[i]) pair.
103+
/// Bounds-aware tail handling; scalar fallback on unsupported arch.
104+
pub fn batch_packed_i4_16<E, F>(
105+
packed: &[u64],
106+
aux: &[i8],
107+
out: &mut [E],
108+
f: F,
109+
)
110+
where
111+
F: Fn(I8x16, i8) -> E + Sync + Send,
112+
E: Copy;
113+
```
114+
115+
**Per-arch implementation hints:**
116+
- **AVX-512:** load 16 × i8 from u64 via `_mm_cvtsi64_si128` + extend with `_mm512_cvtepi8_epi16` + nibble shuffle (PEXTRB or VPSHUFB with a mask LUT), then sign-extend by `_mm_cvtepi8_epi16`. Bench against alternative: PDEP (`_pdep_u64` × 2) into two u64 halves, then load + `vpmovsxbw` for sign-extend. Pick whichever benches faster on Zen4 + Sapphire Rapids.
117+
- **NEON:** `vld1_u8` 8 bytes into `uint8x8_t`, then nibble-split via `vshl_n_s8(v, 4)` and `vshr_n_s8(v, 4)`. Sign-extension is automatic from `vshr_n_s8`.
118+
- **Scalar:** fused loop reading 16 nibbles via `((packed >> (4*i)) & 0xf) as i8` with manual sign-extend (`if x > 7 { x - 16 } else { x }`).
119+
120+
**Consumer call site:** `lance-graph:crates/lance-graph-contract/src/mul.rs::i4_eval::batch` (5 batch fns over `QualiaI4_16D(u64)`). The closure-batch absorbs the 5 fns into closures + classifier names.
121+
122+
**PR #398 codex P1 (NEON OOB at `len==2`) is closed by this primitive** because the batch entry owns tail handling; consumers no longer reach for raw `vld1q_u64(&qualia[i+1].0 as *const u64)`.
123+
124+
---
125+
126+
### W1a-#2`TD-NDARRAY-SIMD-SATURATING-ABS-I8`
127+
128+
**Purpose:** byte-wide saturating absolute value. Closes codex P2 i8::MIN divergence on `lance-graph` PR #398 by giving consumers a single source-of-truth.
129+
130+
**API surface:**
131+
```rust
132+
impl I8x16 {
133+
/// Lane-wise saturating absolute value. saturating_abs(i8::MIN) == i8::MAX.
134+
/// All lanes are independently saturated.
135+
pub fn saturating_abs(self) -> Self;
136+
}
137+
138+
impl I8x32 {
139+
pub fn saturating_abs(self) -> Self; // parity
140+
}
141+
```
142+
143+
**Per-arch implementation:** see § "VPABSB correction" above. The AVX-512 path is `_mm512_min_epu8(_mm512_abs_epi8(x), _mm512_set1_epi8(0x7f))`; NEON is `vqabsq_s8`; scalar is `i8::saturating_abs`.
144+
145+
**Consumer:** `lance-graph:crates/lance-graph-contract/src/mul.rs` (Direction-B fix from PP-16 preflight-drift-auditor 2026-05-16). Spec line 233 of `lance-graph:.claude/specs/pr-sprint-13-simd-i4.md`: `|signed_mantissa| ≤ 1 → ValleyOfDespair` represents weak rule signal, NOT sign-extreme; `i8::MIN` must classify as `Slope/Plateau`, not `ValleyOfDespair`. Scalar in PR #398 is buggy (uses `unsigned_abs() as i8` which wraps `i8::MIN → -128`); the new primitive lets the fix be a one-liner: `lanes.saturating_abs().lane_i8::<0>()` ≤ 1.
146+
147+
---
148+
149+
### W1a-#3`TD-NDARRAY-SIMD-GATHER`
150+
151+
**Purpose:** SIMD gather for palette / lookup-table consumers. Currently `bgz17/src/simd.rs:88` inlines `_mm256_i32gather_epi32` (AP-SIMD-1 violation).
152+
153+
**API surface:**
154+
```rust
155+
impl U16x8 {
156+
/// Gather 8 u16 values from `table` at the given indices.
157+
/// indices[i] >= table.len() => panic in debug, scalar-fallback safe in release.
158+
pub fn gather_u16(indices: U16x8, table: &[u16]) -> Self;
159+
}
160+
161+
/// Convenience: lookup 8 bytes from a u8 LUT by u16 indices.
162+
pub fn palette_lookup_u8x8(idx_v: U16x8, lut: &[u8]) -> U8x8;
163+
```
164+
165+
**Per-arch implementation:**
166+
- **AVX2/AVX-512:** `_mm256_i32gather_epi32` with index widening + downcast (caveat: `_mm256_i32gather_epi32` reads 32 bits per index; for u16 values pack two indices per gather slot, or downcast post-gather).
167+
- **NEON:** no native gather instruction. Scalar loop is fine for 8 lanes — `(0..8).map(|i| table[indices.lane(i) as usize])`.
168+
- **Scalar:** identical to the NEON fallback.
169+
170+
**Bounds:** `gather_u16` MUST validate `max(indices) < table.len()` before the SIMD gather call (debug panic; in release, fall through to scalar with `.get()` for safety).
171+
172+
---
173+
174+
### W1a-#4`TD-NDARRAY-SIMD-PREFETCH`
175+
176+
**Purpose:** cross-arch prefetch hint. Currently `bgz17/src/prefetch.rs:96,100` inlines `_mm_prefetch` and `_prefetch` directly.
177+
178+
**API surface:**
179+
```rust
180+
/// Hint that `ptr` will be read soon; load into L1 (T0) cache.
181+
pub fn prefetch_read_t0(ptr: *const u8);
182+
183+
/// Hint to load into L2 (T1) cache.
184+
pub fn prefetch_read_t1(ptr: *const u8);
185+
186+
/// Hint to load into L3 (T2) cache.
187+
pub fn prefetch_read_t2(ptr: *const u8);
188+
```
189+
190+
**Per-arch implementation:**
191+
- **x86_64:** `_mm_prefetch(ptr as *const i8, _MM_HINT_T0)` / `_T1` / `_T2`.
192+
- **aarch64:** `__pld(ptr)` via inline asm `prfm pldl1keep, [ptr]` (T0), `pldl2keep` (T1), `pldl3keep` (T2). Or wrap `core::intrinsics::prefetch_read_data` if/when stable.
193+
- **Other arches:** no-op (the prefetch contract is a hint, not a guarantee — silent no-op is correct).
194+
195+
**Safety:** `ptr` is allowed to be invalid (prefetch on an unmapped page is a hint that the CPU silently drops on x86). No `assert!` needed.
196+
197+
---
198+
199+
### W1a-#5`TD-NDARRAY-SIMD-POPCOUNT-U64`
200+
201+
**Purpose:** lane-wise popcount of u64 vectors. Currently `holograph/hamming.rs` and `lance-graph:crates/lance-graph/src/graph/blasgraph/types.rs` use `_mm512_popcnt_epi64` directly for Hamming-distance reduction.
202+
203+
**API surface:**
204+
```rust
205+
impl U64x8 {
206+
/// Lane-wise population count. Each lane returns its u64 bit-count (0..=64).
207+
pub fn popcnt(self) -> Self;
208+
209+
/// XOR + lane-wise popcount + horizontal sum across 8 lanes.
210+
/// Optimized for Hamming-distance reductions.
211+
pub fn xor_popcount(self, other: Self) -> u64;
212+
}
213+
214+
impl U64x4 {
215+
pub fn popcnt(self) -> Self; // AVX2 parity
216+
}
217+
```
218+
219+
**Per-arch implementation:**
220+
- **AVX-512 VPOPCNTDQ:** `_mm512_popcnt_epi64` directly. Feature flag `avx512vpopcntdq`.
221+
- **AVX-512 without VPOPCNTDQ:** fallback via `_mm512_sad_epu8` on a per-byte popcount LUT (Mula's algorithm using VPSHUFB).
222+
- **NEON:** `vcntq_u8` for byte popcount, then horizontal sum within each u64 via `vaddvq_u8` or `vpaddlq_u8` cascade.
223+
- **Scalar:** `u64::count_ones` fused loop.
224+
225+
**Note:** the existing `ndarray::hpc::bitwise::popcount_raw` and `hamming_distance_raw` cover the slice case but DO NOT expose a lane-wise method. The new `U64x8::popcnt` fills that gap so consumers can compose Hamming-distance pipelines without dropping back to slice ops.
226+
227+
---
228+
229+
## W1.5 — DEFERRED primitives (gated on `lance-graph:crates/sigker` certification)
230+
231+
Three more primitives are queued behind a certification gate. `crates/sigker` is `lance-graph`'s path-signature codec — it's pure-scalar Rust today (zero raw intrinsics, zero ndarray dep), and is positioned as the **Index-regime third encoding lane** alongside palette-distance (bgz17) and NSM tiling (deepnsm). It explicitly bypasses the `I-NOISE-FLOOR-JIRAK` iron rule (Jirak 2016 Berry-Esseen for weak-dependence data) via Hambly-Lyons 2010 path-signature uniqueness.
232+
233+
When `jc Pillar 11` (Hambly-Lyons signature uniqueness on lance-graph paths) activates and sigker is benchmarked at production carrier widths, the W1.5 queue lights up:
234+
235+
### W1.5-#6`TD-NDARRAY-SIMD-SIGNATURE-PDE-SWEEP`
236+
237+
**Purpose:** signature kernel `〈S(X), S(Y)〉` via Goursat PDE — depth-∞ in O(T₁·T₂) flops, no signature materialization.
238+
239+
**API surface (sketch):**
240+
```rust
241+
pub fn signature_pde_sweep<F>(
242+
x: &[F32x16],
243+
y: &[F32x16],
244+
kernel_fn: F,
245+
) -> f32
246+
where
247+
F: Fn(F32x16, F32x16) -> F32x16;
248+
```
249+
250+
2D banded grid sweep; closure-parameterized kernel evaluator per step.
251+
252+
### W1.5-#7`TD-NDARRAY-SIMD-RANDOMIZED-PROJECTION`
253+
254+
Cuchiero-Schmocker-Teichmann (2021) randomized signatures: Gaussian random-matrix-vector update with `F32x16` state. Same closure-batch shape as W1a-#1, different lane type.
255+
256+
### W1.5-#8`TD-NDARRAY-SIMD-LYNDON-PACK`
257+
258+
Log-signature compression in the Lyndon basis of the free Lie algebra (7-13× compression, lossless). Pack/unpack primitives on `I16x16` state with combinatorial-index awareness.
259+
260+
**No code needed today for W1.5.** Mentioned here so W1a additions are designed broad enough to compose with these later (in particular: the closure-batch shape introduced in W1a-#1 is the foundation for W1.5-#7).
261+
262+
---
263+
264+
## Acceptance criteria for each W1a PR
265+
266+
Every PR adding a primitive from this queue MUST:
267+
268+
1. **Implement all three backends** (AVX-512/AVX2/SSE, NEON, scalar). Missing scalar fallback is a P0 reject — the scalar path is the correctness anchor.
269+
2. **Document the saturating / overflow / signedness semantics** in the doc-comment. State explicitly what happens at edge cases (`i8::MIN`, `u8::MAX`, empty slices, indices out-of-range).
270+
3. **Mandatory parity test** asserting all three backends produce identical output on a fixed-seed randomized corpus that includes edge cases (`i8::MIN`, `0`, `i8::MAX`, mantissa = -128, etc.). Use `proptest` or `quickcheck` if available; otherwise hand-roll 50+ test inputs.
271+
4. **Bench against scalar** — record AVX-512 / NEON speedup ratios in the PR body. No SHIP/LAND gate required for the primitive PR itself (the consumer-side migration PRs will benchmark end-to-end), but a 0.5× anti-speedup ratio is a reject.
272+
5. **`// SAFETY:` comments on every `unsafe` block** per ndarray's existing discipline (`CLAUDE.md` § Hard Rules).
273+
6. **No new `is_*_feature_detected!` calls outside `src/hpc/simd_caps.rs`** — dispatch through the existing `simd_caps()` singleton.
274+
7. **PR description must include the consumer site** (`lance-graph:crates/lance-graph-contract/src/mul.rs:NNN`, etc.) so the post-merge consumer-PR has a known target.
275+
276+
The `simd-savant` agent on the `lance-graph` side runs PRE-MERGE against every W1a PR to verify compliance.
277+
278+
---
279+
280+
## Cross-references
281+
282+
**ndarray-side (this repo):**
283+
- `src/simd.rs` — the public re-export hub. New primitives surface here.
284+
- `src/simd_avx512.rs` — AVX-512 typed wrappers (`I64x8`, `U64x8`, `I8x32`, `F32x16`, `F64x8`, …).
285+
- `src/simd_avx2.rs` — AVX2 typed wrappers (`U8x32`).
286+
- `src/simd_neon.rs` — NEON typed wrappers.
287+
- `src/simd_ops.rs` — high-level vector→vector ops (`add_f32`, `mul_f32`, …).
288+
- `src/simd_int_ops.rs` — integer batch ops (`add_i8`, `dot_i8`, `min_i8`, …).
289+
- `src/hpc/simd_caps.rs` — runtime feature-detect singleton.
290+
- `src/hpc/bitwise.rs` — already-exposed `hamming_distance_raw` + `popcount_raw` (slice case).
291+
292+
**lance-graph-side (the consumer driving this contract):**
293+
- `AdaWorldAPI/lance-graph:.claude/knowledge/ndarray-vertical-simd-alien-magic.md` — full architectural doc + per-workload table
294+
- `AdaWorldAPI/lance-graph:.claude/agents/simd-savant.md` — PRE-MERGE audit gate
295+
- `AdaWorldAPI/lance-graph:.claude/board/EPIPHANIES.md` § `E-SIMD-SWEEP-1` — the 158-violation finding
296+
- `AdaWorldAPI/lance-graph:.claude/board/TECH_DEBT.md` § `TD-NDARRAY-SIMD-*` and § `TD-SIMD-SWEEP-W*` — full debt ledger
297+
- `AdaWorldAPI/lance-graph:.claude/specs/pr-sprint-13-simd-i4.md` — D-CSV-13b spec (the consumer workload spec)
298+
- PR #398 (sprint-13 W-I1 retry) — the codex P1 (NEON OOB) + P2 (i8::MIN divergence) origin
299+
- PR #399 (`simd-savant` card + autoattended-pattern doc) — invariant declaration
300+
- PR #400 (architectural capture commit) — the canonical reference + tech-debt entries
301+
302+
**External references:**
303+
- Intel Intrinsics Guide — `_mm512_abs_epi8` (VPABSB; does NOT saturate `i8::MIN`)
304+
- Intel Intrinsics Guide — `_mm512_min_epu8` (VPMINUB; unsigned-byte minimum, used to clamp the VPABSB result)
305+
- Intel Intrinsics Guide — `_mm512_popcnt_epi64` (VPOPCNTDQ; AVX-512 feature `avx512vpopcntdq`)
306+
- Intel Intrinsics Guide — `_mm256_i32gather_epi32` (VPGATHERDD AVX2)
307+
- ARM Architecture Reference — VQABS (`vqabsq_s8`, hardware-saturating)
308+
- ARM Architecture Reference — VCNT (`vcntq_u8`, byte-wise popcount)
309+
- Hambly & Lyons (2010), "Uniqueness for the signature of a path of bounded variation and the reduced path group"
310+
- Cuchiero, Schmocker & Teichmann (2021), "Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality"
311+
- Jirak (2016), "Berry-Esseen theorems under weak dependence" — the iron rule sigker bypasses
312+
313+
## Litmus tests (for any contributor proposing an addition to this queue)
314+
315+
> **Does the new primitive go on a typed-wrapper struct, or as a free function?**
316+
> Free function = reject; the surface fragments. Struct method = accept.
317+
318+
> **Does the doc-comment state the edge-case behavior (saturating? wrapping? UB? scalar-fallback?)?**
319+
> Missing = reject. The consumer needs to know without reading the code.
320+
321+
> **Are all three backends implemented (AVX*, NEON, scalar)?**
322+
> Missing scalar = reject. Scalar is the correctness anchor.
323+
324+
> **Is there a parity test asserting all three backends produce identical output on a fixed-seed randomized corpus including edge cases?**
325+
> Missing = reject. The codex P2 i8::MIN divergence on `lance-graph` PR #398 happened because no such test existed.
326+
327+
> **Is the consumer site cited in the PR description?**
328+
> Missing = reject. We're shipping primitives for known workloads, not speculative ones.

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ This project uses specialized agents in `.claude/agents/`. Follow these rules:
2828
- Every `unsafe` block needs a `// SAFETY:` comment.
2929
- All public APIs need `///` doc comments with examples.
3030
- `cargo clippy -- -D warnings` must pass.
31+
- **All new public `pub fn` in `src/simd_*.rs` follows the W1a consumer contract** at `.claude/knowledge/vertical-simd-consumer-contract.md` — struct methods on typed wrappers, closure-parameterized batch primitives, all three backends (AVX*/NEON/scalar) implemented, parity test mandatory, saturating/overflow semantics documented. The Ada stack (lance-graph + downstream) enforces "all SIMD from `ndarray::simd`" via its `simd-savant` agent; missing primitives in ndarray force consumer-side raw-intrinsic violations, so additions here are gating the consumer-side sweep. **VPABSB does NOT saturate `i8::MIN`** — see § "VPABSB correction" in the contract doc before implementing `saturating_abs` or any abs primitive.
3132

3233
## Compaction Preservation
3334
When summarizing this conversation, preserve:

0 commit comments

Comments
 (0)