Skip to content

bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207

Closed
AdaWorldAPI wants to merge 3 commits into
masterfrom
claude/ewa-syrk-bench-MAOO0
Closed

bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207
AdaWorldAPI wants to merge 3 commits into
masterfrom
claude/ewa-syrk-bench-MAOO0

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

@AdaWorldAPI AdaWorldAPI commented May 26, 2026

What this is

PR #3 of the 7-PR cross-session program (Kernel lane), rebased on top of #205. Adds benches/ewa_syrk_crossover.rs — a Criterion bench that tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise with a number instead of an assertion:

"3DGS projection is a BLAS workload in disguise → route the EWA covariance sandwich Σ' = M·Σ·Mᵀ through an MKL / OpenBLAS / AMX backend."

The plan is inspiration, not authority (per #201 / the #200 evidence model). The evidence here is project.rs/spd3.rs (whole-read) + the measurement.

What it measures

M·N·Mᵀ three kernel shapes over N = 1k / 100k / 1M:

shape what
scalar hand upper-triangle spd3::sandwich, per element
simd_x16 the shipped SoA sandwich_x16 (the renderer's kernel)
gemm_shape two dense 3×3 matmuls per element — the shape a per-matrix CPU BLAS call imposes, in-process, no FFI (⇒ a lower bound on real cblas)

plus project_batch end-to-end throughput.

Result — measured at target-cpu=x86-64-v4 (AVX-512 native)

Committed .cargo/config.toml stays x86-64-v3 for GitHub/CI portability; benches run at the project's deployment tier v4 via RUSTFLAGS="-Ctarget-cpu=x86-64-v4". F32x16 is a single __m512 at v4.

M·N·Mᵀ sandwich (Melem/s, higher = better):

N scalar simd_x16 gemm_shape (BLAS-shape)
1 024 85.2 175.2 90.1
100 000 76.3 169.6 85.4
1 000 000 81.9 172.0 87.1

Verdict — BLAS backend NOT justified at 3×3

  • gemm_shapescalar, and ~2× slower than simd_x16 at every size 1k→1M. No crossover; the gap is flat, not closing with batch size.
  • gemm_shape has no FFI — real cblas/MKL adds marshalling + dispatch on top, so it can only be worse. There is no efficient CPU batched-3×3 SYRK (that's a GPU pattern).
  • ⇒ The EWA-SYRK backend is a pessimization at 3×3/2×3; the fused SoA SIMD already wins. The 3DGS-EWA-SYRK-BLAS-MKL plan row is idea-only — the sandwich is SYRK-shaped (true), but the actionable backend is killed by measurement.
  • Tier-robust: v3 baseline is within ~5% of v4 for this transpose-bound kernel.
  • Corroborates the splat3d PR-3 prediction of "1.5-2× SIMD-vs-scalar" for the projection kernel.
  • Steelman left open: W·Σ·Wᵀ has a shared W across gaussians → a batched shared-W GEMM is the one form that could differ; benched as a follow-up. Per-gaussian J·Σ·Jᵀ does not batch that way.

Scope / boundary

  • Bench-only: benches/ewa_syrk_crossover.rs + its [[bench]] entry + a RESULTS.md section. No src/ changes. .cargo/config.toml untouched (stays v3).
  • required-features = ["splat3d"], mirroring splat3d_bench.

Test plan

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u


Generated by Claude Code

Summary by CodeRabbit

  • Tests

    • Added performance benchmarks comparing three computation approaches (scalar, SIMD-batched, and dense-matrix) across multiple data sizes, plus end-to-end throughput measurements; benchmarks are gated behind an optional feature.
  • Documentation

    • Added benchmark results and analysis with per-batch timings, throughput numbers, and a conclusion that a BLAS-style backend isn’t justified for the tested scenario, with notes for follow-up benchmarking.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

Adds a Criterion benchmark and RESULTS documentation comparing three covariance "sandwich" kernels—scalar, SIMD-x16 batched, and a gemm-style two 3×3 matmul—measuring per-element and end-to-end throughput across batch sizes and reporting that a BLAS-style backend isn't justified for the tested 3×3 case.

Changes

EWA-SYRK Crossover Benchmark

Layer / File(s) Summary
Benchmark target setup and module documentation
Cargo.toml, benches/ewa_syrk_crossover.rs (1–39)
Adds ewa_syrk_crossover bench target gated by splat3d; module docs describe the three sandwich kernel shapes and the experiment.
Input generation and helpers
benches/ewa_syrk_crossover.rs (40–125)
Adds benchmark constants, RNG, build_spd_pairs with quaternion normalization, and build_gaussians to produce deterministic inputs for benches.
GEMM sandwich and bench_sandwich_paths
benches/ewa_syrk_crossover.rs (83–171)
Implements sandwich_gemm_shape (two explicit 3×3 matmuls + symmetrize) and bench_sandwich_paths that compares scalar, SIMD-x16, and gemm_shape variants across multiple batch sizes.
End-to-end project_batch and results
benches/ewa_syrk_crossover.rs (173–192), benches/RESULTS.md (126–185)
Adds bench_project_batch, registers Criterion groups via criterion_group!/criterion_main!, and documents AVX-512 timings, per-batch and end-to-end throughput tables, and the conclusion that BLAS backend crossover is not supported for the 3×3 sandwich at tested tier.

🎯 2 (Simple) | ⏱️ ~12 minutes

🐰 A rabbit hops through code both old and new,
Three sandwich methods benchmarked, tried, and true.
SIMD and GEMM each take their turn,
To see which kernel makes our matrices burn!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a benchmark to evaluate whether a BLAS-backend approach is justified for EWA-SYRK covariance computation, which is the primary objective.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/ewa-syrk-bench-MAOO0

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/RESULTS.md`:
- Around line 142-143: Update the RUSTFLAGS example in the bench docs to use the
canonical rustc codegen flag spacing; replace the current RUSTFLAGS string that
contains "-Ctarget-cpu=..." with the spaced form "-C target-cpu=..." in the
example near the ewa_syrk_crossover bench command so the documentation matches
rustc's documented syntax.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b2c454ad-3c01-4ebc-adae-f322143b01ad

📥 Commits

Reviewing files that changed from the base of the PR and between b58a284 and 2a35416.

📒 Files selected for processing (3)
  • Cargo.toml
  • benches/RESULTS.md
  • benches/ewa_syrk_crossover.rs

Comment thread benches/RESULTS.md Outdated
claude added 2 commits May 26, 2026 04:25
… premise

Adds benches/ewa_syrk_crossover.rs (W1 PR #3 of the cross-session program).
Tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise — "projection is a BLAS
workload in disguise → MKL/OpenBLAS/AMX backend for the covariance sandwich"
— with a number instead of an assertion.

Compares M·N·Mᵀ three ways over N = 1k/100k/1M:
  scalar       hand upper-triangle sandwich, per element
  simd_x16     shipped SoA F32x16 sandwich_x16 (the renderer's kernel)
  gemm_shape   two dense 3×3 matmuls per element — the shape a per-matrix
               CPU BLAS call imposes, in-process (no FFI ⇒ lower bound on cblas)
plus project_batch end-to-end throughput.

Result (target-cpu=x86-64-v3, AVX-512 via runtime dispatch):
  simd_x16 ~2× faster than BOTH scalar and gemm_shape at every size; no
  crossover up to 1M; gemm_shape ≈ scalar. Real MKL/cblas adds FFI on top.

Verdict: the EWA-SYRK *backend* is a pessimization at 3×3/2×3 — fused SoA
SIMD already wins. The plan row is idea-only (the sandwich IS SYRK-shaped)
but the actionable backend is killed by measurement. Corroborates PR-3's
predicted 1.5-2× SIMD-vs-scalar. Full numbers + verdict in benches/RESULTS.md.

Evidence-grounded per the program: source (project.rs/spd3.rs whole-read) +
measurement, not the plan-as-authority. Steelman (shared-W batched GEMM)
left as a documented follow-up.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Correction: the prior RESULTS reported v3 numbers and wrongly attributed
AVX-512 to runtime dispatch. F32x16 is compile-time-selected by target-cpu,
so v3 measured AVX2. Benches must run at the project's deployment tier v4
(AVX-512 native, F32x16 = __m512); committed .cargo/config.toml stays v3 for
GitHub/CI portability, overridden locally via RUSTFLAGS=-Ctarget-cpu=x86-64-v4.

v4 numbers (Melem/s): simd_x16 175/170/172 vs scalar 85/76/82 vs gemm_shape
90/85/87 at 1k/100k/1M. Verdict unchanged and tier-robust (v3 within ~5%):
simd_x16 ~2x over both scalar and the BLAS-shape, no crossover — the
EWA-SYRK backend is a pessimization at 3x3.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
@AdaWorldAPI AdaWorldAPI force-pushed the claude/ewa-syrk-bench-MAOO0 branch from 2a35416 to 94b0009 Compare May 26, 2026 04:25
… RESULTS

Fixes CI cargo fmt --all --check (rustfmt 1.95.0: wrap the use import,
normalize_quat method chain, and Spd3::new args) and the CodeRabbit nit on
RESULTS.md (canonical `-C target-cpu` spelling with a space). Docs/format
only; no behavior change.

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benches/ewa_syrk_crossover.rs (1)

1-182: ⚠️ Potential issue | 🟠 Major

Fix Clippy doc lint failures in benches/ewa_syrk_crossover.rs

cargo clippy --bench ewa_syrk_crossover --features splat3d -- -D warnings fails due to doc comment formatting:

  • clippy::doc-overindented-list-items at benches/ewa_syrk_crossover.rs lines 21-22
  • clippy::doc-lazy-continuation at benches/ewa_syrk_crossover.rs line 27

Formatting gate can’t be checked here because cargo fmt/rustfmt isn’t available in the environment; run cargo fmt --all -- --check with rustfmt installed to confirm.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 1 - 182, Top-level module doc
comments are tripping Clippy: fix the over-indented list under the "What it
compares" section and the lazy continuation sentence near the introductory
paragraph by reflowing the doc text so list bullets start at the normal comment
margin and wrapped lines are not treated as indented sub-items, and ensure
sentence continuations start with a capital letter or are merged into the
previous line; specifically adjust the doc block that introduces the three
kernels (the bullets `scalar`, `simd_x16`, `gemm_shape`) and the earlier
paragraph that begins "The plan proposes..." so bullets are flush and
continuations are proper sentences to satisfy
clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 1-182: Top-level module doc comments are tripping Clippy: fix the
over-indented list under the "What it compares" section and the lazy
continuation sentence near the introductory paragraph by reflowing the doc text
so list bullets start at the normal comment margin and wrapped lines are not
treated as indented sub-items, and ensure sentence continuations start with a
capital letter or are merged into the previous line; specifically adjust the doc
block that introduces the three kernels (the bullets `scalar`, `simd_x16`,
`gemm_shape`) and the earlier paragraph that begins "The plan proposes..." so
bullets are flush and continuations are proper sentences to satisfy
clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 01d82d53-9e4f-4221-8223-cfe0fde04696

📥 Commits

Reviewing files that changed from the base of the PR and between 2a35416 and 94b0009.

📒 Files selected for processing (3)
  • Cargo.toml
  • benches/RESULTS.md
  • benches/ewa_syrk_crossover.rs
✅ Files skipped from review due to trivial changes (1)
  • benches/RESULTS.md

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benches/ewa_syrk_crossover.rs (1)

130-154: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Make the x16 lane invariant explicit before unwrap().

Line 151–Line 153 rely on an implicit “multiple of 16” assumption. Add an explicit assertion near Line 130 so future size changes fail fast with a clear reason instead of a generic unwrap panic.

Suggested patch
 for &n in &SIZES {
+    assert_eq!(
+        n % 16,
+        0,
+        "ewa_syrk_crossover benchmark requires sizes divisible by 16 for sandwich_x16"
+    );
     let (ms, ns) = build_spd_pairs(n);
     grp.throughput(Throughput::Elements(n as u64));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 130 - 154, The simd_x16 bench
assumes arrays are multiples of 16 but currently panics via try_into unwrap; add
an explicit assertion before using chunks_exact in the simd benchmark (e.g.,
assert!(n % 16 == 0, "simd_x16 requires n to be a multiple of 16, got {}", n))
or assert!(ms.len() % 16 == 0 && ns.len() % 16 == 0) so the failure is clear;
update the closure that calls chunks_exact/try_into (referencing
ms.chunks_exact, ns.chunks_exact, out.chunks_exact_mut, sandwich_x16) to perform
this check before the for loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 40-43: The repo is being tested under Rust 1.94 but ndarray@0.17.2
requires Rust 1.95, so update the project toolchain to Rust 1.95 (or bump the
CI/workflow matrix and any rust-toolchain/rust-toolchain.toml entries) and
re-run the linter/formatter checks; ensure cargo clippy -- -D warnings and cargo
fmt -- --check execute using the new toolchain so benches referencing
ndarray@0.17.2 (e.g., benches/ewa_syrk_crossover.rs) can pass clippy/fmt
validation.

---

Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 130-154: The simd_x16 bench assumes arrays are multiples of 16 but
currently panics via try_into unwrap; add an explicit assertion before using
chunks_exact in the simd benchmark (e.g., assert!(n % 16 == 0, "simd_x16
requires n to be a multiple of 16, got {}", n)) or assert!(ms.len() % 16 == 0 &&
ns.len() % 16 == 0) so the failure is clear; update the closure that calls
chunks_exact/try_into (referencing ms.chunks_exact, ns.chunks_exact,
out.chunks_exact_mut, sandwich_x16) to perform this check before the for loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: c7d82154-8a28-4c41-9cc2-278c0d010f7c

📥 Commits

Reviewing files that changed from the base of the PR and between 94b0009 and d522a32.

📒 Files selected for processing (2)
  • benches/RESULTS.md
  • benches/ewa_syrk_crossover.rs
✅ Files skipped from review due to trivial changes (1)
  • benches/RESULTS.md

Comment on lines +40 to +43
use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use ndarray::hpc::splat3d::{
project_batch, sandwich, sandwich_x16, Camera, Gaussian3D, GaussianBatch, ProjectedBatch, Spd3,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify stable toolchain pin required by guidelines
cargo +1.94.0 --version
rustc +1.94.0 --version

# Lint all benches with warnings denied
cargo +1.94.0 clippy --benches --features splat3d -- -D warnings

# Format check using nightly rustfmt options from rustfmt.toml
cargo +nightly fmt --all -- --check

Repository: AdaWorldAPI/ndarray

Length of output: 3598


Rust 1.94 + clippy/fmt checks can’t be executed for this repo as-is

  • cargo +1.94.0 ... fails before clippy runs: ndarray@0.17.2 requires rustc 1.95, so cargo clippy -- -D warnings and cargo fmt -- --check compliance for this bench can’t be validated under the Rust 1.94 gate.
  • Re-run the required clippy (-D warnings) and rustfmt checks with Rust 1.95 (or align the project’s required toolchain/guidelines) so the verification is meaningful.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 40 - 43, The repo is being tested
under Rust 1.94 but ndarray@0.17.2 requires Rust 1.95, so update the project
toolchain to Rust 1.95 (or bump the CI/workflow matrix and any
rust-toolchain/rust-toolchain.toml entries) and re-run the linter/formatter
checks; ensure cargo clippy -- -D warnings and cargo fmt -- --check execute
using the new toolchain so benches referencing ndarray@0.17.2 (e.g.,
benches/ewa_syrk_crossover.rs) can pass clippy/fmt validation.

Copy link
Copy Markdown
Owner Author

Closing — wrong regime, not a fixable bench.

This benched the float spd3 EWA covariance sandwich and asked whether a BLAS/SYRK backend beats it. Grounding against the actual substrate (cognitive-distance-typing.md, the Distance contract, cam_pq, blasgraph) shows the premise is a category error: similarity here is integer — HDR popcount (the cosine replacement, by topology not value) → Base17 L1 → Palette256 — with Fisher-z as a palette-output normalization. There is no float dot product to accelerate, so "is BLAS faster" was never coherent. spd3 f32 is only the graphics rasterizer's consumer, not this path.

Net: float code built from an inspiration plan, never grounded against the real distance stack. Closing rather than de-scoping.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants