Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,11 @@ name = "splat3d_bench"
harness = false
required-features = ["splat3d"]

[[bench]]
name = "ewa_syrk_crossover"
harness = false
required-features = ["splat3d"]

[features]
default = ["std", "hpc-extras"]

Expand Down
60 changes: 60 additions & 0 deletions benches/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,63 @@ The demo binary `examples/splat3d_flex.rs` and integration test
Full-pipeline frame-time numbers (p50/p95/p99) await a Inria bicycle
scene download — left as a follow-up for the dedicated benchmarking
session against real-world data.

## EWA-SYRK crossover — kill-or-justify the BLAS-backend premise

Bench: `benches/ewa_syrk_crossover.rs`. Tests whether the
`3DGS-EWA-SYRK-BLAS-MKL` plan's premise — "projection is a BLAS workload
in disguise → route the covariance sandwich through an MKL/OpenBLAS/AMX
backend" — holds for the **3×3** EWA sandwich `Σ' = M·Σ·Mᵀ`.

### Hardware / build

Container, AVX-512F+BW+VL. The committed `.cargo/config.toml` pins
`target-cpu=x86-64-v3` (for GitHub/CI portability); **benches are run at the
project's deployment tier `x86-64-v4`** (AVX-512 native — `F32x16` is a
single `__m512`), via the documented override:

```bash
RUSTFLAGS="-C target-cpu=x86-64-v4" \
cargo bench --features splat3d --bench ewa_syrk_crossover
```

### `M·N·Mᵀ` sandwich — three kernel shapes (Melem/s, higher = better) @ v4

| N | scalar | `simd_x16` | `gemm_shape` (BLAS-shape) |
|---|---|---|---|
| 1 024 | 85.2 | **175.2** | 90.1 |
| 100 000 | 76.3 | **169.6** | 85.4 |
| 1 000 000 | 81.9 | **172.0** | 87.1 |

`gemm_shape` = two dense 3×3 matmuls per element (the shape a per-matrix
BLAS call imposes), **in-process, no FFI**. The v3 baseline is within ~5% of
these v4 numbers for this transpose-bound 6-field kernel — the verdict is
tier-robust.

### `project_batch` end-to-end @ v4

| N | throughput |
|---|---|
| 1 024 | 12.1 Melem/s (84 µs) |

(full pipeline incl. scalar `sh_eval_deg3` per visible gaussian — SH eval
dominates; the covariance sandwich is a small fraction of this.)

### Verdict — BLAS backend NOT justified at 3×3

- `gemm_shape` is statistically identical to `scalar` and **~2× slower than
the shipped `simd_x16`** at every size 1k→1M. **No crossover**; the gap is
flat, not closing with batch size.
- `gemm_shape` carries **no FFI** — a real `cblas`/MKL call adds marshalling
+ dispatch on top, so it can only be worse. There is no efficient CPU
batched-3×3 SYRK (that pattern is a GPU one).
- ⇒ The EWA-SYRK *backend* (native/MKL/OpenBLAS/AMX dispatch for the
covariance sandwich) is a **pessimization** at 3×3/2×3: fused SoA SIMD
already wins. The plan row is **idea-only** — the sandwich *is*
SYRK-shaped (true) but the actionable backend is killed by measurement.
- Corroborates PR-3's predicted "1.5-2× SIMD-vs-scalar": `simd_x16` is ~2×
over scalar at large N (transpose amortised, unlike the transpose-bound
N=16 PR-1 microbench).
- Steelman left open: `W·Σ·Wᵀ` has a *shared* `W` across gaussians → a
batched shared-`W` GEMM is the one form that could differ; benched as a
follow-up. Per-gaussian `J·Σ·Jᵀ` does not batch that way.
192 changes: 192 additions & 0 deletions benches/ewa_syrk_crossover.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
//! EWA-SYRK crossover bench — kill-or-justify the "3DGS projection is a
//! BLAS workload in disguise" premise of `3DGS-EWA-SYRK-BLAS-MKL-plan.md`.
//!
//! # The question
//!
//! The plan proposes routing the EWA covariance sandwich `Σ' = M·Σ·Mᵀ`
//! through a batched SYRK/GEMM backend (native / MKL / OpenBLAS / AMX). But
//! the sandwich matrices are **3×3** (world→camera) and **2×3** (camera→
//! image), and `splat3d::project` already runs them 16-wide via the
//! `crate::simd` SoA polyfill. Batched *tiny* GEMM is exactly the regime
//! where CPU BLAS loses to a fused SIMD unroll (per-call overhead dominates;
//! there is no efficient CPU batched-3×3 SYRK — that pattern is a GPU one).
//!
//! This bench measures it instead of asserting it.
//!
//! # What it compares (same `M·N·Mᵀ` math, three kernel shapes)
//!
//! - `scalar` — the hand-unrolled upper-triangle `spd3::sandwich`, per element.
//! - `simd_x16` — the shipped `sandwich_x16` SoA path (the renderer's kernel).
//! - `gemm_shape` — two dense 3×3 matmuls per element (`M·N` then `·Mᵀ`),
//! with no SoA fusion. This models the **shape a CPU BLAS
//! backend imposes** — one small dense GEMM per matrix.
//!
//! `gemm_shape` runs **in-process, with no FFI**, so it is a *lower bound*
//! on a real `cblas`/MKL path: a genuine BLAS call adds argument marshalling
//! + dispatch overhead on top. If `gemm_shape` already loses to `simd_x16`,
//! the BLAS backend loses by more.
//!
//! Plus `project_batch` end-to-end throughput at the same batch sizes, so
//! the sandwich cost is seen in proportion to the full pipeline.
//!
//! Sweep: 1 024 / 100 000 / 1 000 000 (all exact multiples of 16, no tail).
//!
//! Caveat (honest scope): this measures the *per-element* kernel shape. The
//! one genuinely batchable step is `W·Σ·Wᵀ`, where `W` (the camera view
//! rotation) is shared across all gaussians — a shared-`W` batched GEMM is a
//! steelman left as a follow-up. The per-image Jacobian `J` differs per
//! gaussian, so `J·Σ·Jᵀ` does not batch that way.

use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion, Throughput};
use ndarray::hpc::splat3d::{
project_batch, sandwich, sandwich_x16, Camera, Gaussian3D, GaussianBatch, ProjectedBatch, Spd3,
};
Comment on lines +40 to +43
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify stable toolchain pin required by guidelines
cargo +1.94.0 --version
rustc +1.94.0 --version

# Lint all benches with warnings denied
cargo +1.94.0 clippy --benches --features splat3d -- -D warnings

# Format check using nightly rustfmt options from rustfmt.toml
cargo +nightly fmt --all -- --check

Repository: AdaWorldAPI/ndarray

Length of output: 3598


Rust 1.94 + clippy/fmt checks can’t be executed for this repo as-is

  • cargo +1.94.0 ... fails before clippy runs: ndarray@0.17.2 requires rustc 1.95, so cargo clippy -- -D warnings and cargo fmt -- --check compliance for this bench can’t be validated under the Rust 1.94 gate.
  • Re-run the required clippy (-D warnings) and rustfmt checks with Rust 1.95 (or align the project’s required toolchain/guidelines) so the verification is meaningful.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@benches/ewa_syrk_crossover.rs` around lines 40 - 43, The repo is being tested
under Rust 1.94 but ndarray@0.17.2 requires Rust 1.95, so update the project
toolchain to Rust 1.95 (or bump the CI/workflow matrix and any
rust-toolchain/rust-toolchain.toml entries) and re-run the linter/formatter
checks; ensure cargo clippy -- -D warnings and cargo fmt -- --check execute
using the new toolchain so benches referencing ndarray@0.17.2 (e.g.,
benches/ewa_syrk_crossover.rs) can pass clippy/fmt validation.


const SIZES: [usize; 3] = [1_024, 100_000, 1_000_000];

#[inline]
fn rng(s: &mut u32) -> f32 {
*s ^= *s << 13;
*s ^= *s >> 17;
*s ^= *s << 5;
(*s as f32) / (u32::MAX as f32)
}

/// `n` distinct SPD pairs via `from_scale_quat` (entry-wise different across
/// all 6 SoA channels so the SIMD transpose does real work).
fn build_spd_pairs(n: usize) -> (Vec<Spd3>, Vec<Spd3>) {
let mut st = 0x1234_5678u32;
let mut ms = Vec::with_capacity(n);
let mut ns = Vec::with_capacity(n);
for _ in 0..n {
let sm = [0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st)];
let sn = [0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st), 0.3 + 1.5 * rng(&mut st)];
let mut qm = [rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5];
let mut qn = [rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5, rng(&mut st) - 0.5];
normalize_quat(&mut qm);
normalize_quat(&mut qn);
ms.push(Spd3::from_scale_quat(sm, qm));
ns.push(Spd3::from_scale_quat(sn, qn));
}
(ms, ns)
}

fn normalize_quat(q: &mut [f32; 4]) {
let n = (q[0] * q[0] + q[1] * q[1] + q[2] * q[2] + q[3] * q[3])
.sqrt()
.max(1e-12);
for v in q.iter_mut() {
*v /= n;
}
}

/// `M·N·Mᵀ` via two dense 3×3 matmuls — the shape a per-matrix BLAS call
/// imposes (no upper-triangle fusion, no SoA). Output upper triangle, off-
/// diagonals averaged to match `sandwich`'s symmetrisation.
#[inline]
fn sandwich_gemm_shape(m: &Spd3, n: &Spd3) -> Spd3 {
let a = m.to_rows();
let b = n.to_rows();
let mut t = [[0.0f32; 3]; 3];
for i in 0..3 {
for j in 0..3 {
t[i][j] = a[i][0] * b[0][j] + a[i][1] * b[1][j] + a[i][2] * b[2][j];
}
}
// R = T · Mᵀ (Mᵀ[k][j] = A[j][k])
let mut r = [[0.0f32; 3]; 3];
for i in 0..3 {
for j in 0..3 {
r[i][j] = t[i][0] * a[j][0] + t[i][1] * a[j][1] + t[i][2] * a[j][2];
}
}
Spd3::new(
r[0][0],
0.5 * (r[0][1] + r[1][0]),
0.5 * (r[0][2] + r[2][0]),
r[1][1],
0.5 * (r[1][2] + r[2][1]),
r[2][2],
)
}

fn build_gaussians(n: usize) -> GaussianBatch {
let mut st = 0x9E37_79B9u32;
let mut batch = GaussianBatch::with_capacity(n);
for _ in 0..n {
let mut g = Gaussian3D::unit();
g.mean = [4.0 * rng(&mut st) - 2.0, 4.0 * rng(&mut st) - 2.0, 1.0 + 5.0 * rng(&mut st)];
g.scale = [0.05 + 0.2 * rng(&mut st); 3];
g.quat = [1.0, 0.0, 0.0, 0.0];
g.opacity = 0.5 + 0.5 * rng(&mut st);
batch.push(g);
}
batch
}

fn bench_sandwich_paths(c: &mut Criterion) {
let mut grp = c.benchmark_group("ewa_sandwich_paths");
grp.sample_size(20);
for &n in &SIZES {
let (ms, ns) = build_spd_pairs(n);
grp.throughput(Throughput::Elements(n as u64));

grp.bench_with_input(BenchmarkId::new("scalar", n), &n, |b, &n| {
let mut out = vec![Spd3::ZERO; n];
b.iter(|| {
for i in 0..n {
out[i] = sandwich(black_box(&ms[i]), black_box(&ns[i]));
}
black_box(&out[n - 1]);
});
});

grp.bench_with_input(BenchmarkId::new("simd_x16", n), &n, |b, &n| {
let mut out = vec![Spd3::ZERO; n];
b.iter(|| {
let mc = ms.chunks_exact(16);
let nc = ns.chunks_exact(16);
let oc = out.chunks_exact_mut(16);
for ((mm, nn), oo) in mc.zip(nc).zip(oc) {
let ma: &[Spd3; 16] = mm.try_into().unwrap();
let na: &[Spd3; 16] = nn.try_into().unwrap();
let oa: &mut [Spd3; 16] = oo.try_into().unwrap();
sandwich_x16(black_box(ma), black_box(na), oa);
}
black_box(&out[n - 1]);
});
});

grp.bench_with_input(BenchmarkId::new("gemm_shape", n), &n, |b, &n| {
let mut out = vec![Spd3::ZERO; n];
b.iter(|| {
for i in 0..n {
out[i] = sandwich_gemm_shape(black_box(&ms[i]), black_box(&ns[i]));
}
black_box(&out[n - 1]);
});
});
}
grp.finish();
}

fn bench_project_batch(c: &mut Criterion) {
let mut grp = c.benchmark_group("ewa_project_batch");
grp.sample_size(10);
let cam = Camera::identity_at_origin(1920, 1080);
for &n in &SIZES {
let batch = build_gaussians(n);
let mut out = ProjectedBatch::with_capacity(batch.capacity);
grp.throughput(Throughput::Elements(n as u64));
grp.bench_with_input(BenchmarkId::from_parameter(n), &n, |b, _| {
b.iter(|| {
project_batch(black_box(&batch), black_box(&cam), &mut out);
black_box(out.len);
});
});
}
grp.finish();
}

criterion_group!(ewa_syrk, bench_sandwich_paths, bench_project_batch);
criterion_main!(ewa_syrk);
Loading