bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3) by AdaWorldAPI · Pull Request #207 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-05-26T04:17:04Z

What this is

PR #3 of the 7-PR cross-session program (Kernel lane), rebased on top of #205. Adds benches/ewa_syrk_crossover.rs — a Criterion bench that tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise with a number instead of an assertion:

"3DGS projection is a BLAS workload in disguise → route the EWA covariance sandwich Σ' = M·Σ·Mᵀ through an MKL / OpenBLAS / AMX backend."

The plan is inspiration, not authority (per #201 / the #200 evidence model). The evidence here is project.rs/spd3.rs (whole-read) + the measurement.

What it measures

M·N·Mᵀ three kernel shapes over N = 1k / 100k / 1M:

shape	what
`scalar`	hand upper-triangle `spd3::sandwich`, per element
`simd_x16`	the shipped SoA `sandwich_x16` (the renderer's kernel)
`gemm_shape`	two dense 3×3 matmuls per element — the shape a per-matrix CPU BLAS call imposes, in-process, no FFI (⇒ a lower bound on real `cblas`)

plus project_batch end-to-end throughput.

Result — measured at `target-cpu=x86-64-v4` (AVX-512 native)

Committed .cargo/config.toml stays x86-64-v3 for GitHub/CI portability; benches run at the project's deployment tier v4 via RUSTFLAGS="-Ctarget-cpu=x86-64-v4". F32x16 is a single __m512 at v4.

M·N·Mᵀ sandwich (Melem/s, higher = better):

N	scalar	simd_x16	gemm_shape (BLAS-shape)
1 024	85.2	175.2	90.1
100 000	76.3	169.6	85.4
1 000 000	81.9	172.0	87.1

Verdict — BLAS backend NOT justified at 3×3

gemm_shape ≈ scalar, and ~2× slower than simd_x16 at every size 1k→1M. No crossover; the gap is flat, not closing with batch size.
gemm_shape has no FFI — real cblas/MKL adds marshalling + dispatch on top, so it can only be worse. There is no efficient CPU batched-3×3 SYRK (that's a GPU pattern).
⇒ The EWA-SYRK backend is a pessimization at 3×3/2×3; the fused SoA SIMD already wins. The 3DGS-EWA-SYRK-BLAS-MKL plan row is idea-only — the sandwich is SYRK-shaped (true), but the actionable backend is killed by measurement.
Tier-robust: v3 baseline is within ~5% of v4 for this transpose-bound kernel.
Corroborates the splat3d PR-3 prediction of "1.5-2× SIMD-vs-scalar" for the projection kernel.
Steelman left open: W·Σ·Wᵀ has a shared W across gaussians → a batched shared-W GEMM is the one form that could differ; benched as a follow-up. Per-gaussian J·Σ·Jᵀ does not batch that way.

Scope / boundary

Bench-only: benches/ewa_syrk_crossover.rs + its [[bench]] entry + a RESULTS.md section. No src/ changes. .cargo/config.toml untouched (stays v3).
required-features = ["splat3d"], mirroring splat3d_bench.

Test plan

cargo bench --features splat3d --bench ewa_syrk_crossover --no-run compiles clean on master (incl. feat(cesium): implement tileset.rs cold-import parser — Group-A entry, no-serde #205)
Full run captured at v4; numbers + verdict in benches/RESULTS.md
No src/ changes; config stays v3

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Generated by Claude Code

Summary by CodeRabbit

Tests
- Added performance benchmarks comparing three computation approaches (scalar, SIMD-batched, and dense-matrix) across multiple data sizes, plus end-to-end throughput measurements; benchmarks are gated behind an optional feature.
Documentation
- Added benchmark results and analysis with per-batch timings, throughput numbers, and a conclusion that a BLAS-style backend isn’t justified for the tested scenario, with notes for follow-up benchmarking.

coderabbitai · 2026-05-26T04:17:21Z

📝 Walkthrough

Walkthrough

Adds a Criterion benchmark and RESULTS documentation comparing three covariance "sandwich" kernels—scalar, SIMD-x16 batched, and a gemm-style two 3×3 matmul—measuring per-element and end-to-end throughput across batch sizes and reporting that a BLAS-style backend isn't justified for the tested 3×3 case.

Changes

EWA-SYRK Crossover Benchmark

Layer / File(s)	Summary
Benchmark target setup and module documentation `Cargo.toml`, `benches/ewa_syrk_crossover.rs` (1–39)	Adds `ewa_syrk_crossover` bench target gated by `splat3d`; module docs describe the three sandwich kernel shapes and the experiment.
Input generation and helpers `benches/ewa_syrk_crossover.rs` (40–125)	Adds benchmark constants, RNG, `build_spd_pairs` with quaternion normalization, and `build_gaussians` to produce deterministic inputs for benches.
GEMM sandwich and bench_sandwich_paths `benches/ewa_syrk_crossover.rs` (83–171)	Implements `sandwich_gemm_shape` (two explicit 3×3 matmuls + symmetrize) and `bench_sandwich_paths` that compares scalar, SIMD-x16, and gemm_shape variants across multiple batch sizes.
End-to-end project_batch and results `benches/ewa_syrk_crossover.rs` (173–192), `benches/RESULTS.md` (126–185)	Adds `bench_project_batch`, registers Criterion groups via `criterion_group!`/`criterion_main!`, and documents AVX-512 timings, per-batch and end-to-end throughput tables, and the conclusion that BLAS backend crossover is not supported for the 3×3 sandwich at tested tier.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A rabbit hops through code both old and new,
Three sandwich methods benchmarked, tried, and true.
SIMD and GEMM each take their turn,
To see which kernel makes our matrices burn!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding a benchmark to evaluate whether a BLAS-backend approach is justified for EWA-SYRK covariance computation, which is the primary objective.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch claude/ewa-syrk-bench-MAOO0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/RESULTS.md`:
- Around line 142-143: Update the RUSTFLAGS example in the bench docs to use the
canonical rustc codegen flag spacing; replace the current RUSTFLAGS string that
contains "-Ctarget-cpu=..." with the spaced form "-C target-cpu=..." in the
example near the ewa_syrk_crossover bench command so the documentation matches
rustc's documented syntax.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b2c454ad-3c01-4ebc-adae-f322143b01ad

📥 Commits

Reviewing files that changed from the base of the PR and between b58a284 and 2a35416.

📒 Files selected for processing (3)

Cargo.toml
benches/RESULTS.md
benches/ewa_syrk_crossover.rs

… premise Adds benches/ewa_syrk_crossover.rs (W1 PR #3 of the cross-session program). Tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise — "projection is a BLAS workload in disguise → MKL/OpenBLAS/AMX backend for the covariance sandwich" — with a number instead of an assertion. Compares M·N·Mᵀ three ways over N = 1k/100k/1M: scalar hand upper-triangle sandwich, per element simd_x16 shipped SoA F32x16 sandwich_x16 (the renderer's kernel) gemm_shape two dense 3×3 matmuls per element — the shape a per-matrix CPU BLAS call imposes, in-process (no FFI ⇒ lower bound on cblas) plus project_batch end-to-end throughput. Result (target-cpu=x86-64-v3, AVX-512 via runtime dispatch): simd_x16 ~2× faster than BOTH scalar and gemm_shape at every size; no crossover up to 1M; gemm_shape ≈ scalar. Real MKL/cblas adds FFI on top. Verdict: the EWA-SYRK *backend* is a pessimization at 3×3/2×3 — fused SoA SIMD already wins. The plan row is idea-only (the sandwich IS SYRK-shaped) but the actionable backend is killed by measurement. Corroborates PR-3's predicted 1.5-2× SIMD-vs-scalar. Full numbers + verdict in benches/RESULTS.md. Evidence-grounded per the program: source (project.rs/spd3.rs whole-read) + measurement, not the plan-as-authority. Steelman (shared-W batched GEMM) left as a documented follow-up. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

Correction: the prior RESULTS reported v3 numbers and wrongly attributed AVX-512 to runtime dispatch. F32x16 is compile-time-selected by target-cpu, so v3 measured AVX2. Benches must run at the project's deployment tier v4 (AVX-512 native, F32x16 = __m512); committed .cargo/config.toml stays v3 for GitHub/CI portability, overridden locally via RUSTFLAGS=-Ctarget-cpu=x86-64-v4. v4 numbers (Melem/s): simd_x16 175/170/172 vs scalar 85/76/82 vs gemm_shape 90/85/87 at 1k/100k/1M. Verdict unchanged and tier-robust (v3 within ~5%): simd_x16 ~2x over both scalar and the BLAS-shape, no crossover — the EWA-SYRK backend is a pessimization at 3x3. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

… RESULTS Fixes CI cargo fmt --all --check (rustfmt 1.95.0: wrap the use import, normalize_quat method chain, and Spd3::new args) and the CodeRabbit nit on RESULTS.md (canonical `-C target-cpu` spelling with a space). Docs/format only; no behavior change. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benches/ewa_syrk_crossover.rs (1)
1-182: ⚠️ Potential issue | 🟠 Major

Fix Clippy doc lint failures in benches/ewa_syrk_crossover.rs

cargo clippy --bench ewa_syrk_crossover --features splat3d -- -D warnings fails due to doc comment formatting:

clippy::doc-overindented-list-items at benches/ewa_syrk_crossover.rs lines 21-22

clippy::doc-lazy-continuation at benches/ewa_syrk_crossover.rs line 27

Formatting gate can’t be checked here because cargo fmt/rustfmt isn’t available in the environment; run cargo fmt --all -- --check with rustfmt installed to confirm.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 1 - 182, Top-level module doc
comments are tripping Clippy: fix the over-indented list under the "What it
compares" section and the lazy continuation sentence near the introductory
paragraph by reflowing the doc text so list bullets start at the normal comment
margin and wrapped lines are not treated as indented sub-items, and ensure
sentence continuations start with a capital letter or are merged into the
previous line; specifically adjust the doc block that introduces the three
kernels (the bullets `scalar`, `simd_x16`, `gemm_shape`) and the earlier
paragraph that begins "The plan proposes..." so bullets are flush and
continuations are proper sentences to satisfy
clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 1-182: Top-level module doc comments are tripping Clippy: fix the
over-indented list under the "What it compares" section and the lazy
continuation sentence near the introductory paragraph by reflowing the doc text
so list bullets start at the normal comment margin and wrapped lines are not
treated as indented sub-items, and ensure sentence continuations start with a
capital letter or are merged into the previous line; specifically adjust the doc
block that introduces the three kernels (the bullets `scalar`, `simd_x16`,
`gemm_shape`) and the earlier paragraph that begins "The plan proposes..." so
bullets are flush and continuations are proper sentences to satisfy
clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 01d82d53-9e4f-4221-8223-cfe0fde04696

📥 Commits

Reviewing files that changed from the base of the PR and between 2a35416 and 94b0009.

📒 Files selected for processing (3)

Cargo.toml
benches/RESULTS.md
benches/ewa_syrk_crossover.rs

✅ Files skipped from review due to trivial changes (1)

benches/RESULTS.md

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benches/ewa_syrk_crossover.rs (1)

130-154: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the x16 lane invariant explicit before unwrap().

Line 151–Line 153 rely on an implicit “multiple of 16” assumption. Add an explicit assertion near Line 130 so future size changes fail fast with a clear reason instead of a generic unwrap panic.

Suggested patch

 for &n in &SIZES {
+    assert_eq!(
+        n % 16,
+        0,
+        "ewa_syrk_crossover benchmark requires sizes divisible by 16 for sandwich_x16"
+    );
     let (ms, ns) = build_spd_pairs(n);
     grp.throughput(Throughput::Elements(n as u64));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 130 - 154, The simd_x16 bench
assumes arrays are multiples of 16 but currently panics via try_into unwrap; add
an explicit assertion before using chunks_exact in the simd benchmark (e.g.,
assert!(n % 16 == 0, "simd_x16 requires n to be a multiple of 16, got {}", n))
or assert!(ms.len() % 16 == 0 && ns.len() % 16 == 0) so the failure is clear;
update the closure that calls chunks_exact/try_into (referencing
ms.chunks_exact, ns.chunks_exact, out.chunks_exact_mut, sandwich_x16) to perform
this check before the for loop.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 40-43: The repo is being tested under Rust 1.94 but ndarray@0.17.2
requires Rust 1.95, so update the project toolchain to Rust 1.95 (or bump the
CI/workflow matrix and any rust-toolchain/rust-toolchain.toml entries) and
re-run the linter/formatter checks; ensure cargo clippy -- -D warnings and cargo
fmt -- --check execute using the new toolchain so benches referencing
ndarray@0.17.2 (e.g., benches/ewa_syrk_crossover.rs) can pass clippy/fmt
validation.

---

Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 130-154: The simd_x16 bench assumes arrays are multiples of 16 but
currently panics via try_into unwrap; add an explicit assertion before using
chunks_exact in the simd benchmark (e.g., assert!(n % 16 == 0, "simd_x16
requires n to be a multiple of 16, got {}", n)) or assert!(ms.len() % 16 == 0 &&
ns.len() % 16 == 0) so the failure is clear; update the closure that calls
chunks_exact/try_into (referencing ms.chunks_exact, ns.chunks_exact,
out.chunks_exact_mut, sandwich_x16) to perform this check before the for loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c7d82154-8a28-4c41-9cc2-278c0d010f7c

📥 Commits

Reviewing files that changed from the base of the PR and between 94b0009 and d522a32.

📒 Files selected for processing (2)

benches/RESULTS.md
benches/ewa_syrk_crossover.rs

✅ Files skipped from review due to trivial changes (1)

benches/RESULTS.md

coderabbitai · 2026-05-26T04:38:14Z

+use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
+use ndarray::hpc::splat3d::{
+    project_batch, sandwich, sandwich_x16, Camera, Gaussian3D, GaussianBatch, ProjectedBatch, Spd3,
+};


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify stable toolchain pin required by guidelines cargo +1.94.0 --version rustc +1.94.0 --version # Lint all benches with warnings denied cargo +1.94.0 clippy --benches --features splat3d -- -D warnings # Format check using nightly rustfmt options from rustfmt.toml cargo +nightly fmt --all -- --check

Repository: AdaWorldAPI/ndarray

Length of output: 3598

Rust 1.94 + clippy/fmt checks can’t be executed for this repo as-is

cargo +1.94.0 ... fails before clippy runs: ndarray@0.17.2 requires rustc 1.95, so cargo clippy -- -D warnings and cargo fmt -- --check compliance for this bench can’t be validated under the Rust 1.94 gate.

Re-run the required clippy (-D warnings) and rustfmt checks with Rust 1.95 (or align the project’s required toolchain/guidelines) so the verification is meaningful.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/ewa_syrk_crossover.rs` around lines 40 - 43, The repo is being tested under Rust 1.94 but ndarray@0.17.2 requires Rust 1.95, so update the project toolchain to Rust 1.95 (or bump the CI/workflow matrix and any rust-toolchain/rust-toolchain.toml entries) and re-run the linter/formatter checks; ensure cargo clippy -- -D warnings and cargo fmt -- --check execute using the new toolchain so benches referencing ndarray@0.17.2 (e.g., benches/ewa_syrk_crossover.rs) can pass clippy/fmt validation.

AdaWorldAPI · 2026-05-26T04:54:17Z

Closing — wrong regime, not a fixable bench.

This benched the float spd3 EWA covariance sandwich and asked whether a BLAS/SYRK backend beats it. Grounding against the actual substrate (cognitive-distance-typing.md, the Distance contract, cam_pq, blasgraph) shows the premise is a category error: similarity here is integer — HDR popcount (the cosine replacement, by topology not value) → Base17 L1 → Palette256 — with Fisher-z as a palette-output normalization. There is no float dot product to accelerate, so "is BLAS faster" was never coherent. spd3 f32 is only the graphics rasterizer's consumer, not this path.

Net: float code built from an inspiration plan, never grounded against the real distance stack. Closing rather than de-scoping.

Generated by Claude Code

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread benches/RESULTS.md Outdated

claude added 2 commits May 26, 2026 04:25

AdaWorldAPI force-pushed the claude/ewa-syrk-bench-MAOO0 branch from 2a35416 to 94b0009 Compare May 26, 2026 04:25

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

AdaWorldAPI closed this May 26, 2026

AdaWorldAPI mentioned this pull request May 27, 2026

docs(board): refresh blackboard to current epoch + append 2026-05-26 epiphanies #209

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207

bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207
AdaWorldAPI wants to merge 3 commits into
masterfrom
claude/ewa-syrk-bench-MAOO0

AdaWorldAPI commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

AdaWorldAPI commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented May 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is

What it measures

Result — measured at target-cpu=x86-64-v4 (AVX-512 native)

Verdict — BLAS backend NOT justified at 3×3

Scope / boundary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

AdaWorldAPI commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdaWorldAPI commented May 26, 2026 •

edited by coderabbitai Bot

Loading

Result — measured at `target-cpu=x86-64-v4` (AVX-512 native)

coderabbitai Bot commented May 26, 2026 •

edited

Loading