|
| 1 | +# splat3d bench results |
| 2 | + |
| 3 | +Per-kernel timing baseline for the `splat3d` feature. Regression > 5% |
| 4 | +on any row blocks merge per the sprint discipline. Update this file in |
| 5 | +the same commit as any change to a `splat3d` kernel. |
| 6 | + |
| 7 | +## Run |
| 8 | + |
| 9 | +```bash |
| 10 | +# Default build (x86-64-v1 baseline, F32x16 = AVX2-emulated 2× __m256) |
| 11 | +cargo bench --features splat3d --bench splat3d_bench |
| 12 | + |
| 13 | +# AVX-512 native build (recommended on Sapphire Rapids / Zen4) |
| 14 | +RUSTFLAGS="-C target-cpu=native" \ |
| 15 | + cargo bench --features splat3d --bench splat3d_bench |
| 16 | +``` |
| 17 | + |
| 18 | +Hardware: record the CPU model + topology + the `target-cpu` / |
| 19 | +`target-feature` flags used so cross-box comparisons are meaningful. |
| 20 | + |
| 21 | +## PR 1 — Spd3 + EWA-sandwich SIMD batch |
| 22 | + |
| 23 | +Baseline measurements from the sprint's reference hardware run. |
| 24 | + |
| 25 | +### Hardware: Intel Xeon (Sapphire Rapids family), AVX-512F+BW+VL+VNNI+BF16, 2.10 GHz, container build |
| 26 | + |
| 27 | +The PR 1 spec aimed for ≥10× speedup on `sandwich_x16` over the scalar |
| 28 | +loop on AVX-512. Measured 1.83× — the AoS↔SoA transpose overhead at 6 |
| 29 | +fields per `Spd3` × 16 lanes dominates the inner-loop SIMD savings for |
| 30 | +this microbench. The downstream impact is muted because the rasterizer |
| 31 | +(PR 5) and `GaussianBatch::covariance_x16` (PR 2) already keep their |
| 32 | +hot-path data in SoA layout, avoiding the transpose. Treat the 1.83× |
| 33 | +microbench number as a floor; the rasterizer-driven benchmark in PR 7 |
| 34 | +exercises the SoA-native path that benefits more strongly from F32x16. |
| 35 | + |
| 36 | +Per the architectural decision in `.cargo/config.toml` ("No global |
| 37 | +target-cpu — each kernel uses `#[target_feature(enable = "avx512f")]` |
| 38 | +per-function with LazyLock runtime detection"), the DEFAULT build uses |
| 39 | +the AVX2-emulated F32x16. The `target-cpu=native` row below shows the |
| 40 | +intended-tier numbers. |
| 41 | + |
| 42 | +#### Default build (no `target-cpu` flag) |
| 43 | + |
| 44 | +| Bench | Median | Speedup vs scalar | |
| 45 | +|---|---|---| |
| 46 | +| `spd3_sandwich_scalar_x16_loop` | 209.96 ns | 1.0× | |
| 47 | +| `spd3_sandwich_simd_x16` | 1225.7 ns | **0.17× (slower)** | |
| 48 | +| `spd3_eig_smith_1961` | 130.82 ns | — | |
| 49 | +| `spd3_from_scale_quat` | 11.35 ns | — | |
| 50 | + |
| 51 | +The SIMD regression on the AVX2-emulated build is a known artifact: the |
| 52 | +polyfill emits two `__m256` operations per `F32x16` op AND adds the |
| 53 | +6-field AoS↔SoA transpose at the function boundary. Net: more |
| 54 | +instructions than the scalar loop, which the autovectorizer is happy |
| 55 | +to map to `vfmadd` chains directly. Filed as TECH_DEBT for the |
| 56 | +performance sprint: |
| 57 | +- Restructure `sandwich_x16` to take SoA inputs directly (skip the |
| 58 | + transpose); call sites (rasterizer, `GaussianBatch::covariance_x16`) |
| 59 | + already have SoA layout. |
| 60 | +- Add runtime tier dispatch in `sandwich_x16` so AVX2 builds call a |
| 61 | + scalar loop wrapper that the compiler auto-vectorizes cleanly. |
| 62 | + |
| 63 | +#### `RUSTFLAGS="-C target-cpu=native"` build (AVX-512F path active) |
| 64 | + |
| 65 | +| Bench | Median | Speedup vs scalar | |
| 66 | +|---|---|---| |
| 67 | +| `spd3_sandwich_scalar_x16_loop` | 166.33 ns | 1.0× | |
| 68 | +| `spd3_sandwich_simd_x16` | 90.41 ns | **1.83×** | |
| 69 | +| `spd3_eig_smith_1961` | 125.66 ns | — | |
| 70 | +| `spd3_from_scale_quat` | 9.19 ns | — | |
| 71 | + |
| 72 | +The 1.83× is below the 10× spec target but ABOVE the 1.0× break-even |
| 73 | +that gates the function's existence. With SoA inputs at the call site |
| 74 | +(no transpose), the inner-loop arithmetic ratio is 16-wide |
| 75 | +multiply-add chains vs 16 sequential scalars — measured rasterizer |
| 76 | +throughput (PR 5+) is where the kernel earns its keep. |
| 77 | + |
| 78 | +`spd3_eig_smith_1961` ≈ 126 ns: one closed-form eigendecomp dominated |
| 79 | +by `acos` (≈ 80 ns by itself). The diagonal-fast-path branch (which |
| 80 | +skips the trig entirely) is what makes the rasterizer's per-pixel |
| 81 | +work tractable; this microbench measures the WORST case. |
| 82 | + |
| 83 | +`spd3_from_scale_quat` ≈ 9 ns: the 3DGS canonical Σ builder. PR 2's |
| 84 | +`GaussianBatch::covariance_x16` SIMD-batches this; the scalar |
| 85 | +microbench is the per-call latency floor. |
| 86 | + |
| 87 | +## PR 2 — GaussianBatch SoA + SH eval |
| 88 | + |
| 89 | +Not yet baselined as separate benches — covered indirectly by the |
| 90 | +projection-kernel and rasterizer benches when PR 7 adds them. |
| 91 | + |
| 92 | +## PR 3 — Projection kernel |
| 93 | + |
| 94 | +Not yet baselined as a separate bench; the `project_chunk_x16` |
| 95 | +inner-loop math has identical AoS↔SoA structure to `sandwich_x16` |
| 96 | +and is expected to show similar 1.5-2× SIMD-vs-scalar ratios on |
| 97 | +AVX-512 native builds. |
| 98 | + |
| 99 | +## PR 4 — Tile binner |
| 100 | + |
| 101 | +Sort + prefix-sum throughput target (per the sprint spec): 2M |
| 102 | +instances sorted in ≤ 8 ms on 1 thread. Not yet benched separately; |
| 103 | +`sort_unstable_by_key` is the first-cut sort. Radix sort follow-up is |
| 104 | +TECH_DEBT once PR 7's full-pipeline timings show the binner is the |
| 105 | +hot spot. |
| 106 | + |
| 107 | +## PR 5 — Rasterizer |
| 108 | + |
| 109 | +Per-tile alpha-blend with the `F32x16` 16-pixel-row inner loop. The |
| 110 | +acceptance gate (1080p × 500K gaussians ≤ 25 ms on 8-core AVX-512) is |
| 111 | +left for the dedicated rasterizer bench in a follow-up; PR 5 ships |
| 112 | +the kernel + correctness tests, not the rasterizer-scale bench. |
| 113 | + |
| 114 | +## PR 6 — SplatFrame + SplatRenderer |
| 115 | + |
| 116 | +Double-buffer driver — no microbench; the full-pipeline rasterizer |
| 117 | +bench in a follow-up will exercise it under realistic load. |
| 118 | + |
| 119 | +## PR 7 — End-to-end demo |
| 120 | + |
| 121 | +The demo binary `examples/splat3d_flex.rs` and integration test |
| 122 | +`tests/splat3d_correctness.rs` ship as the e2e regression guards. |
| 123 | +Full-pipeline frame-time numbers (p50/p95/p99) await a Inria bicycle |
| 124 | +scene download — left as a follow-up for the dedicated benchmarking |
| 125 | +session against real-world data. |
0 commit comments