Skip to content

W1a vertical SIMD primitives + W2/W3/W4 polish (compile-time dispatch)#203

Merged
AdaWorldAPI merged 6 commits into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0
May 26, 2026
Merged

W1a vertical SIMD primitives + W2/W3/W4 polish (compile-time dispatch)#203
AdaWorldAPI merged 6 commits into
masterfrom
claude/splat3d-cpu-simd-renderer-MAOO0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

@AdaWorldAPI AdaWorldAPI commented May 26, 2026

Summary

Lands the W1a vertical-SIMD consumer-contract primitives (the only outstanding "wave" — it gates the downstream lance-graph raw-intrinsic sweep) plus the deferred/cosmetic follow-ups from W2/W3/W4. One PR, four file-disjoint commits.

Posture (deliberate): compile-time #[cfg(target_feature)] dispatch only via simd.rssimd_{type}.rs. Runtime (LazyLock<CpuCaps>) dispatch is intentionally deferred — please don't flag its absence as a gap. Scalar is the mandatory correctness anchor; only one backend compiles per cargo config, so parity tests assert active-backend == inline scalar reference and cross-backend agreement is covered by running the suite under each config.

Commits

  • W1a SIMD (simd.rs, simd_avx512/avx2/neon/scalar.rs, simd_int_ops.rs)
    • I8x16::from_i4_packed_u64 + lane_i8 + batch_packed_i4_16 (i4→i8 sign-extend)
    • I8x16/I8x32::saturating_absVPABSB correction (_mm256_abs_epi8+_mm256_min_epu8; NEON vqabsq_s8; scalar i8::saturating_abs), binding saturating_abs(i8::MIN)==i8::MAX
    • U16x8::gather_u16 + palette_lookup_u8x8 (bounds-checked; scalar-safe in release)
    • prefetch_read_t0/t1/t2 (x86 _mm_prefetch; aarch64 prfm; no-op elsewhere)
    • U64x8::popcnt/xor_popcount + U64x4::popcnt (AVX-512 VPOPCNTDQ; scalar fallback)
  • W2softmax_axis_f32 / log_softmax_axis_f32 over ArrayView2; sum_axis_f32 skipped (ndarray's .sum_axis() covers it)
  • W3SoaVec::iter_rows() + SoaRowIter, #[inline], Clone/Debug
  • W4bulk_for_each (+ #[deprecated] bulk_scan alias), un-gated integration test, #[inline]

Consumer sites (per W1a acceptance #7)

lance-graph-contract::mul i4_eval batch; bgz17 gather/prefetch; holograph/blasgraph hamming.

Known follow-ups (non-blocking)

  • I8x16 on x86 is scalar-storage [i8;16]from_i4_packed_u64/saturating_abs are scalar loops; a __m128i (_mm_abs_epi8/_mm_min_epu8) vectorization is a follow-up. I8x32::saturating_abs already uses the real VPABSB path.
  • aarch64 prfm asm options + NEON paths are compile-checked here only on x86; rely on the NEON CI job.

Test plan

  • cargo build default v3 — clean
  • cargo build --config .cargo/config-avx512.toml (v4, _mm512_popcnt_epi64 path) — clean
  • cargo clippy --lib — clean (-D warnings)
  • cargo test --lib2136 passed / 0 failed; W1a parity tests green incl. i8::MIN/u64::MAX/OOB
  • new-item doctests — 24 passed
  • aarch64 NEON job (CI)
  • AVX-512 silicon run (CI; can't execute v4 binary on this runner)

Design: .claude/knowledge/w1a-simd-integration-plan.md, vertical-simd-consumer-contract.md.

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41


Generated by Claude Code

Summary by CodeRabbit

  • Documentation

    • Added W1a SIMD integration plan documentation.
  • New Features

    • Added axis-aware softmax and log-softmax functions for 2-D arrays.
    • Introduced row iteration support for SoaVec.
    • Added bulk_for_each for read-only chunked traversal.
    • Expanded SIMD utility primitives across all supported platforms.
  • Improvements

    • SoaVec now derives Clone and Debug for better usability.
  • Deprecations

    • bulk_scan superseded by bulk_for_each.

Review Change Stack

claude added 5 commits May 26, 2026 01:41
…register

Pin the compile-time-only / runtime-deferred posture, file-disjoint agent
assignments (W1 simd_*; W2 activations+reductions; W3 soa; W4 bulk), the
dependency map (missing x86 homes for I8x16/U16x8; U64x8 polyfill on AVX2),
and the dual-config verification strategy (only one backend compiles per
cargo config; parity tests assert active-backend == inline scalar reference).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
… backends

Ship the 5 W1a consumer-contract primitives that unblock the lance-graph raw-
intrinsic sweep, surfaced through `simd.rs` compile-time `#[cfg(target_feature)]`
dispatch (no runtime dispatch — deferred), with scalar as the correctness anchor:

- I8x16::from_i4_packed_u64 + lane_i8 + batch_packed_i4_16 (i4→i8 sign-extend)
- I8x16/I8x32::saturating_abs (VPABSB correction: _mm256_abs_epi8 + _mm256_min_epu8;
  NEON vqabsq_s8; scalar i8::saturating_abs — saturating_abs(i8::MIN)==i8::MAX)
- U16x8::gather_u16 + palette_lookup_u8x8 (bounds-checked; scalar-safe in release)
- prefetch_read_t0/t1/t2 (x86 _mm_prefetch; aarch64 prfm; no-op elsewhere)
- U64x8::popcnt/xor_popcount + U64x4::popcnt (AVX-512 VPOPCNTDQ; scalar fallback)

Backends: AVX-512 (native where available), AVX2/scalar polyfill, NEON, scalar.
Parity tests assert active-backend == scalar reference (compile-time dispatch =
one backend per build) incl. i8::MIN / u64::MAX / OOB edge cases.
Consumer sites: lance-graph-contract::mul i4_eval, bgz17 gather/prefetch,
holograph/blasgraph hamming. Compiles clean on default v3 and config-avx512.

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
…_axis_f32

Add the deferred axis variants over ArrayView2 along an Axis, delegating per-lane
to the existing stable 1-D softmax/log_softmax kernels. sum_axis_f32 intentionally
not added (ndarray's ArrayBase::sum_axis already covers it — NOTE in reductions.rs).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Add SoaVec::iter_rows() (+ SoaRowIter, ExactSizeIterator), #[inline] on
aos_to_soa/soa_to_aos, and Clone/Debug on SoaVec. No SoaBatch alias (not in
the design doc).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Rename bulk_scan to bulk_for_each (bulk_scan kept as a #[deprecated] forwarding
alias), un-gate the bulk_apply x aos_to_soa integration test, add #[inline].
Updated the one in-repo caller in blocked_grid/tests.rs.

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d4e9df13-f7e7-4b42-b3e9-c9805825cb9e

📥 Commits

Reviewing files that changed from the base of the PR and between 2f74c03 and 069640b.

📒 Files selected for processing (12)
  • .claude/knowledge/w1a-simd-integration-plan.md
  • src/hpc/activations.rs
  • src/hpc/blocked_grid/tests.rs
  • src/hpc/bulk.rs
  • src/hpc/reductions.rs
  • src/hpc/soa.rs
  • src/simd.rs
  • src/simd_avx2.rs
  • src/simd_avx512.rs
  • src/simd_int_ops.rs
  • src/simd_neon.rs
  • src/simd_scalar.rs

📝 Walkthrough

Walkthrough

This pull request introduces W1a SIMD primitives—packed-i4 unpacking, gather/LUT operations, and cache prefetching—across x86_64 (AVX-512), aarch64 (NEON), and scalar backends, alongside HPC improvements to bulk traversal naming, SoA vector iteration, and multidimensional activation functions.

Changes

W1a SIMD Primitives & HPC Integration

Layer / File(s) Summary
Integration Plan & HPC Context
.claude/knowledge/w1a-simd-integration-plan.md, src/hpc/reductions.rs
New integration-plan documentation specifies per-agent responsibilities, dependency entanglements, behavioral constraints, and verification gates. HPC reductions module documents why sum_axis_f32 wrapper is not added (ndarray's sum_axis already covers it).
HPC Bulk Traversal API Refactoring
src/hpc/bulk.rs, src/hpc/blocked_grid/tests.rs
bulk_scan is renamed to bulk_for_each and deprecated with an inline alias; documentation and tests updated to reflect the new canonical name; integration test in W4 composition changes from bulk_scan to bulk_for_each.
HPC SoA Vector Enhancement
src/hpc/soa.rs
SoaVec<T, N> now derives Clone and Debug; new iter_rows() method (for T: Copy) reconstructs rows as [T; N] via SoaRowIter iterator; aos_to_soa/soa_to_aos marked with #[inline]; tests added for row iteration, empty slices, and size hint behavior.
HPC Axis-aware Activation Functions
src/hpc/activations.rs
New public functions softmax_axis_f32 and log_softmax_axis_f32 apply existing 1-D SIMD kernels independently to each 1-D lane along a specified 2-D axis; includes axis bounds validation, shape matching, and tests validating row/column normalization and panic on invalid indices.
W1a SIMD Polyfill Primitives (AVX-512 Reference)
src/simd_avx512.rs
Introduces x86_64-backed scalar polyfill types I8x16, U8x8, U16x8 with constructors/conversions, packed-i4 nibble unpacking via from_i4_packed_u64, lane extraction, and lane-wise saturating_abs; defines U16x8::gather_u16 indexed gather with bounds-safe fallback; adds palette_lookup_u8x8 8-lane LUT helper and batch_packed_i4_16 generic closure-driven batch processor; implements I8x32::saturating_abs and x86_64 prefetch intrinsics prefetch_read_t0/t1/t2.
W1a SIMD Popcount Operations
src/simd_avx512.rs, src/simd_avx2.rs
Extends U64x8 with lane-wise popcnt and reduction-style xor_popcount; adds U64x4::popcnt; uses _mm512_popcnt_epi64 under compile-time avx512vpopcntdq, else falls back to per-lane scalar count_ones().
W1a SIMD Cross-Platform Backend Support
src/simd_neon.rs, src/simd_scalar.rs
Implements W1a primitives for aarch64 NEON and scalar fallback: same polyfill types, i4 unpacking, saturating abs, gather/LUT helpers; aarch64 uses inline asm! PRFM hints for prefetch, scalar backend uses no-ops; all backends include bounds-safe gather/LUT with debug assertions and release-time safe 0 fallback for out-of-range indices.
W1a SIMD Public API Surface & Comprehensive Tests
src/simd.rs, src/simd_int_ops.rs
Re-exports W1a primitives in architecture-conditional tiers: x86_64 AVX-512, AVX2 baseline, aarch64 NEON, and scalar fallback. Comprehensive test suite validates i4 unpacking and sign extension, saturating absolute value edge cases, gather/LUT correctness against fixed references, prefetch no-ops on valid and null pointers, popcount lane-wise and reduction operations, and batch i4 processing with zero and non-zero nibble patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Whiskers twitching with SIMD delight,
W1a primitives shining ever bright,
Packed nibbles dance, gathers leap and play,
Cross-platform backends light the way,
HPC flows refined—a chorus of speed! 🚀

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/splat3d-cpu-simd-renderer-MAOO0

Comment @coderabbitai help to get the list of available commands and usage tips.

@AdaWorldAPI AdaWorldAPI marked this pull request as ready for review May 26, 2026 02:04
@AdaWorldAPI AdaWorldAPI merged commit 9ef918c into master May 26, 2026
17 of 18 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 26, 2026
…a tests

I8x16::saturating_abs now uses _mm_abs_epi8 + _mm_min_epu8 (the contract's
VPABSB correction: VPABSB returns 0x80 for i8::MIN, VPMINUB clamps to 0x7f)
instead of a per-lane branching scalar loop — 16 lanes branchless.

Also adds the binding W1a unit tests that #203 shipped without (only
rust,ignore doctests existed): saturating_abs(i8::MIN)==i8::MAX for I8x16
and I8x32, a scalar-reference corpus, i4 sign-extension, U64x8 popcnt /
xor_popcount, and gather_u16. All 6 pass on the v3 build.

Not changed (measured, not assumed): U64x8::popcnt on AVX2 already lowers
to hardware POPCNT via count_ones; gather_u16 stays scalar because a 32-bit
_mm256_i32gather over a &[u16] over-reads past the last index (no 16-bit
hardware gather exists).

https://claude.ai/code/session_017GFLBnDy23AWBqvkbHHC41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants