bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207
bench(splat3d): EWA-SYRK crossover — kill-or-justify the BLAS-backend premise (W1 #3)#207AdaWorldAPI wants to merge 3 commits into
Conversation
📝 WalkthroughWalkthroughAdds a Criterion benchmark and RESULTS documentation comparing three covariance "sandwich" kernels—scalar, SIMD-x16 batched, and a gemm-style two 3×3 matmul—measuring per-element and end-to-end throughput across batch sizes and reporting that a BLAS-style backend isn't justified for the tested 3×3 case. ChangesEWA-SYRK Crossover Benchmark
🎯 2 (Simple) | ⏱️ ~12 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/RESULTS.md`:
- Around line 142-143: Update the RUSTFLAGS example in the bench docs to use the
canonical rustc codegen flag spacing; replace the current RUSTFLAGS string that
contains "-Ctarget-cpu=..." with the spaced form "-C target-cpu=..." in the
example near the ewa_syrk_crossover bench command so the documentation matches
rustc's documented syntax.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b2c454ad-3c01-4ebc-adae-f322143b01ad
📒 Files selected for processing (3)
Cargo.tomlbenches/RESULTS.mdbenches/ewa_syrk_crossover.rs
… premise Adds benches/ewa_syrk_crossover.rs (W1 PR #3 of the cross-session program). Tests the 3DGS-EWA-SYRK-BLAS-MKL plan's premise — "projection is a BLAS workload in disguise → MKL/OpenBLAS/AMX backend for the covariance sandwich" — with a number instead of an assertion. Compares M·N·Mᵀ three ways over N = 1k/100k/1M: scalar hand upper-triangle sandwich, per element simd_x16 shipped SoA F32x16 sandwich_x16 (the renderer's kernel) gemm_shape two dense 3×3 matmuls per element — the shape a per-matrix CPU BLAS call imposes, in-process (no FFI ⇒ lower bound on cblas) plus project_batch end-to-end throughput. Result (target-cpu=x86-64-v3, AVX-512 via runtime dispatch): simd_x16 ~2× faster than BOTH scalar and gemm_shape at every size; no crossover up to 1M; gemm_shape ≈ scalar. Real MKL/cblas adds FFI on top. Verdict: the EWA-SYRK *backend* is a pessimization at 3×3/2×3 — fused SoA SIMD already wins. The plan row is idea-only (the sandwich IS SYRK-shaped) but the actionable backend is killed by measurement. Corroborates PR-3's predicted 1.5-2× SIMD-vs-scalar. Full numbers + verdict in benches/RESULTS.md. Evidence-grounded per the program: source (project.rs/spd3.rs whole-read) + measurement, not the plan-as-authority. Steelman (shared-W batched GEMM) left as a documented follow-up. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Correction: the prior RESULTS reported v3 numbers and wrongly attributed AVX-512 to runtime dispatch. F32x16 is compile-time-selected by target-cpu, so v3 measured AVX2. Benches must run at the project's deployment tier v4 (AVX-512 native, F32x16 = __m512); committed .cargo/config.toml stays v3 for GitHub/CI portability, overridden locally via RUSTFLAGS=-Ctarget-cpu=x86-64-v4. v4 numbers (Melem/s): simd_x16 175/170/172 vs scalar 85/76/82 vs gemm_shape 90/85/87 at 1k/100k/1M. Verdict unchanged and tier-robust (v3 within ~5%): simd_x16 ~2x over both scalar and the BLAS-shape, no crossover — the EWA-SYRK backend is a pessimization at 3x3. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
2a35416 to
94b0009
Compare
… RESULTS Fixes CI cargo fmt --all --check (rustfmt 1.95.0: wrap the use import, normalize_quat method chain, and Spd3::new args) and the CodeRabbit nit on RESULTS.md (canonical `-C target-cpu` spelling with a space). Docs/format only; no behavior change. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benches/ewa_syrk_crossover.rs (1)
1-182:⚠️ Potential issue | 🟠 MajorFix Clippy doc lint failures in benches/ewa_syrk_crossover.rs
cargo clippy --bench ewa_syrk_crossover --features splat3d -- -D warningsfails due to doc comment formatting:
clippy::doc-overindented-list-itemsatbenches/ewa_syrk_crossover.rslines 21-22clippy::doc-lazy-continuationatbenches/ewa_syrk_crossover.rsline 27Formatting gate can’t be checked here because
cargo fmt/rustfmtisn’t available in the environment; runcargo fmt --all -- --checkwith rustfmt installed to confirm.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/ewa_syrk_crossover.rs` around lines 1 - 182, Top-level module doc comments are tripping Clippy: fix the over-indented list under the "What it compares" section and the lazy continuation sentence near the introductory paragraph by reflowing the doc text so list bullets start at the normal comment margin and wrapped lines are not treated as indented sub-items, and ensure sentence continuations start with a capital letter or are merged into the previous line; specifically adjust the doc block that introduces the three kernels (the bullets `scalar`, `simd_x16`, `gemm_shape`) and the earlier paragraph that begins "The plan proposes..." so bullets are flush and continuations are proper sentences to satisfy clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 1-182: Top-level module doc comments are tripping Clippy: fix the
over-indented list under the "What it compares" section and the lazy
continuation sentence near the introductory paragraph by reflowing the doc text
so list bullets start at the normal comment margin and wrapped lines are not
treated as indented sub-items, and ensure sentence continuations start with a
capital letter or are merged into the previous line; specifically adjust the doc
block that introduces the three kernels (the bullets `scalar`, `simd_x16`,
`gemm_shape`) and the earlier paragraph that begins "The plan proposes..." so
bullets are flush and continuations are proper sentences to satisfy
clippy::doc-overindented-list-items and clippy::doc-lazy-continuation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 01d82d53-9e4f-4221-8223-cfe0fde04696
📒 Files selected for processing (3)
Cargo.tomlbenches/RESULTS.mdbenches/ewa_syrk_crossover.rs
✅ Files skipped from review due to trivial changes (1)
- benches/RESULTS.md
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benches/ewa_syrk_crossover.rs (1)
130-154:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMake the x16 lane invariant explicit before
unwrap().Line 151–Line 153 rely on an implicit “multiple of 16” assumption. Add an explicit assertion near Line 130 so future size changes fail fast with a clear reason instead of a generic unwrap panic.
Suggested patch
for &n in &SIZES { + assert_eq!( + n % 16, + 0, + "ewa_syrk_crossover benchmark requires sizes divisible by 16 for sandwich_x16" + ); let (ms, ns) = build_spd_pairs(n); grp.throughput(Throughput::Elements(n as u64));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@benches/ewa_syrk_crossover.rs` around lines 130 - 154, The simd_x16 bench assumes arrays are multiples of 16 but currently panics via try_into unwrap; add an explicit assertion before using chunks_exact in the simd benchmark (e.g., assert!(n % 16 == 0, "simd_x16 requires n to be a multiple of 16, got {}", n)) or assert!(ms.len() % 16 == 0 && ns.len() % 16 == 0) so the failure is clear; update the closure that calls chunks_exact/try_into (referencing ms.chunks_exact, ns.chunks_exact, out.chunks_exact_mut, sandwich_x16) to perform this check before the for loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 40-43: The repo is being tested under Rust 1.94 but ndarray@0.17.2
requires Rust 1.95, so update the project toolchain to Rust 1.95 (or bump the
CI/workflow matrix and any rust-toolchain/rust-toolchain.toml entries) and
re-run the linter/formatter checks; ensure cargo clippy -- -D warnings and cargo
fmt -- --check execute using the new toolchain so benches referencing
ndarray@0.17.2 (e.g., benches/ewa_syrk_crossover.rs) can pass clippy/fmt
validation.
---
Outside diff comments:
In `@benches/ewa_syrk_crossover.rs`:
- Around line 130-154: The simd_x16 bench assumes arrays are multiples of 16 but
currently panics via try_into unwrap; add an explicit assertion before using
chunks_exact in the simd benchmark (e.g., assert!(n % 16 == 0, "simd_x16
requires n to be a multiple of 16, got {}", n)) or assert!(ms.len() % 16 == 0 &&
ns.len() % 16 == 0) so the failure is clear; update the closure that calls
chunks_exact/try_into (referencing ms.chunks_exact, ns.chunks_exact,
out.chunks_exact_mut, sandwich_x16) to perform this check before the for loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: c7d82154-8a28-4c41-9cc2-278c0d010f7c
📒 Files selected for processing (2)
benches/RESULTS.mdbenches/ewa_syrk_crossover.rs
✅ Files skipped from review due to trivial changes (1)
- benches/RESULTS.md
| use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput}; | ||
| use ndarray::hpc::splat3d::{ | ||
| project_batch, sandwich, sandwich_x16, Camera, Gaussian3D, GaussianBatch, ProjectedBatch, Spd3, | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify stable toolchain pin required by guidelines
cargo +1.94.0 --version
rustc +1.94.0 --version
# Lint all benches with warnings denied
cargo +1.94.0 clippy --benches --features splat3d -- -D warnings
# Format check using nightly rustfmt options from rustfmt.toml
cargo +nightly fmt --all -- --checkRepository: AdaWorldAPI/ndarray
Length of output: 3598
Rust 1.94 + clippy/fmt checks can’t be executed for this repo as-is
cargo +1.94.0 ...fails before clippy runs:ndarray@0.17.2 requires rustc 1.95, socargo clippy -- -D warningsandcargo fmt -- --checkcompliance for this bench can’t be validated under the Rust 1.94 gate.- Re-run the required clippy (
-D warnings) and rustfmt checks with Rust 1.95 (or align the project’s required toolchain/guidelines) so the verification is meaningful.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benches/ewa_syrk_crossover.rs` around lines 40 - 43, The repo is being tested
under Rust 1.94 but ndarray@0.17.2 requires Rust 1.95, so update the project
toolchain to Rust 1.95 (or bump the CI/workflow matrix and any
rust-toolchain/rust-toolchain.toml entries) and re-run the linter/formatter
checks; ensure cargo clippy -- -D warnings and cargo fmt -- --check execute
using the new toolchain so benches referencing ndarray@0.17.2 (e.g.,
benches/ewa_syrk_crossover.rs) can pass clippy/fmt validation.
|
Closing — wrong regime, not a fixable bench. This benched the float Net: float code built from an inspiration plan, never grounded against the real distance stack. Closing rather than de-scoping. Generated by Claude Code |
What this is
PR #3 of the 7-PR cross-session program (Kernel lane), rebased on top of #205. Adds
benches/ewa_syrk_crossover.rs— a Criterion bench that tests the3DGS-EWA-SYRK-BLAS-MKLplan's premise with a number instead of an assertion:The plan is inspiration, not authority (per #201 / the #200 evidence model). The evidence here is
project.rs/spd3.rs(whole-read) + the measurement.What it measures
M·N·Mᵀthree kernel shapes over N = 1k / 100k / 1M:scalarspd3::sandwich, per elementsimd_x16sandwich_x16(the renderer's kernel)gemm_shapecblas)plus
project_batchend-to-end throughput.Result — measured at
target-cpu=x86-64-v4(AVX-512 native)M·N·Mᵀsandwich (Melem/s, higher = better):Verdict — BLAS backend NOT justified at 3×3
gemm_shape≈scalar, and ~2× slower thansimd_x16at every size 1k→1M. No crossover; the gap is flat, not closing with batch size.gemm_shapehas no FFI — realcblas/MKL adds marshalling + dispatch on top, so it can only be worse. There is no efficient CPU batched-3×3 SYRK (that's a GPU pattern).3DGS-EWA-SYRK-BLAS-MKLplan row is idea-only — the sandwich is SYRK-shaped (true), but the actionable backend is killed by measurement.W·Σ·Wᵀhas a sharedWacross gaussians → a batched shared-WGEMM is the one form that could differ; benched as a follow-up. Per-gaussianJ·Σ·Jᵀdoes not batch that way.Scope / boundary
benches/ewa_syrk_crossover.rs+ its[[bench]]entry + a RESULTS.md section. Nosrc/changes..cargo/config.tomluntouched (stays v3).required-features = ["splat3d"], mirroringsplat3d_bench.Test plan
cargo bench --features splat3d --bench ewa_syrk_crossover --no-runcompiles clean on master (incl. feat(cesium): implement tileset.rs cold-import parser — Group-A entry, no-serde #205)benches/RESULTS.mdsrc/changes; config stays v3https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
Generated by Claude Code
Summary by CodeRabbit
Tests
Documentation