You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Companion to: `pr-x12-codec-x265-design.md` (the as-shipped HEVC-analog spec) — this doc is the *generalisation* of that spec across the rest of the stack
8
+
>
9
+
> **Post-merge formalisation (2026-05-22):** the bench / cost / dep-direction claims below have been numbered and pinned in `pr-x12-canon-resolutions-delta.md`:
@@ -120,6 +130,8 @@ This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_comp
120
130
121
131
## 4. Palette / basin codebook — what HEVC SCC tried and missed
122
132
133
+
> [Codebook lifecycle pinned post-merge as **R-13**: the codec exposes the basin codebook as a swappable handle (LocalEphemeral | SharedClusterWide | SharedRegional | PretrainedStatic). The 4096-entry capacity claim below is unchanged; what's new is that the codebook is *not baked* into the codec — orchestration (q2 / woa-rs) picks the right one per request.]
134
+
123
135
### 4.1 The 12-bit basin = 4096-entry vocabulary
124
136
125
137
`MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:79`). The full 12-bit range addresses 4096 real basins — every `LeafCu` carries an index into a fully-populated per-Heel codebook. No slot is reserved as a sentinel: the HHTL ontology (`Heel > Hip > Twig > Leaf`, see `src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl`) defines the codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`, every Leaf carrying a real `basinSignature`. Authoring-time uncertainty ("not yet decided") stays in the encoder's `Option<u16>` scratch state and never leaks onto the wire. For:
@@ -171,6 +183,8 @@ This is **the most underrated** of the four mappings. Optimizer research treats
> [Resolved post-merge as **R-5**: per-arch crossover constants, calibrated by Plan G's `codec-bench`. Concrete defaults landed in canon-resolutions-delta §R-5 — SPR=64, ICX=32, Zen4=96, Apple M=256, Graviton=128. See `pr-x12-x265-blasgraph-gemm.md` §2.2 for the full GEMM-form derivation.]
187
+
174
188
Single 32×32 DCT-II via butterflies: ~80 ops. Same via GEMM (`C = A @ DCT_BASIS`): ~32K ops. **Per-block, butterfly wins by 400×**. But:
175
189
176
190
- For a 4K frame with ~1024 CUs, batched GEMM amortises hardware fusion
@@ -496,6 +510,8 @@ Six places where blasgraph + MKL change the algorithmic complexity, not just con
496
510
497
511
### 13.1 Block-matched ME → batched i8gemm (E-7)
498
512
513
+
> [Pinned as **R-6**: SSD-via-GEMM identity is the canonical ME path; the API lives at `ndarray::hpc::blas_level2::batched_ssd_search`. The 50× win is reproduced in the GEMM-lens companion doc; the bench is asserted by Plan G video lane (R-4).]
514
+
499
515
Classical ME: SAD over 32×32 window. Reformulate as SSD via `||A||² - 2A·B + ||B||²` — middle term is a GEMM. AVX-512 VNNI `i8gemm_i32` does a whole CTU's motion candidates in one call. **~50× over hand-tuned NEON/AVX2 SAD.**
500
516
501
517
### 13.2 Batched DCT-II via MKL sgemm (E-7-variant)
@@ -504,6 +520,8 @@ Per-block butterfly wins for single 32×32. Per-frame batched `C = A_batch @ DCT
504
520
505
521
### 13.3 CTU partition mode-decision as tropical-GEMM (E-8)
506
522
523
+
> [Pinned as **R-7**: tropical-GEMM kernel lives in `lance-graph::blasgraph::tropical_gemm`; the codec calls into it. The `ndarray-codec → lance-graph` dep direction was confirmed *allowed* post-merge (both are sibling crates above `ndarray::hpc` and below `woa-rs`). See R-7 in the delta doc for the dep-graph audit.]
524
+
507
525
x265 spends ~30% CPU on recursive partition RDO. Reformulate: each partition is a node in an 85-node DAG, edges = split/merge transitions, weights = ΔRDO. Optimal partition = shortest path. blasgraph's tropical-semiring GEMM (`D ← min(D, D + W)`) solves all partitions in **one batched matrix-relax**. `O(4^d)` → `O(d²)` per CTU.
> · **`pr-x12-bgz-jc-substrate-synergies.md`** (PR-X12 grounded: bgz17/bgz-tensor/bgz-hhtl-d/jc already implement most of the substrate)
13
30
14
31
## TL;DR
15
32
@@ -186,6 +203,8 @@ literature snapshot I'm working from; **claim** is the right word, not
186
203
187
204
### E1. **`MergeDir` is a topology, not a direction.**
188
205
206
+
> [Resolved post-merge as **R-9**: the 4-way alphabet *stays* canonical on the wire — `{N, E, W, S}` discriminant is pinned for HEVC compatibility. Wider topologies (6-way 3D, 8-way diagonal-aware) layer *above* the codec via a `Topology<Mode>` trait, but the wire format does not extend. See `pr-x12-canon-resolutions-delta.md` §R-9 for the rationale: extending the wire alphabet to 6/8 ways would invalidate HEVC's 2-bit `header_kind` field and break the goal of being decodable by spec-conformant HEVC tooling.]
207
+
189
208
`{North, East, West, South}` happens to be a 2D Cartesian raster
190
209
mental model. The codec doesn't care. The discriminant alphabet just
191
210
needs to be a 4-way categorical over "which of 4 neighbours did I
@@ -271,6 +290,8 @@ The user's "Pertuberationslernen" instinct lands here.
271
290
272
291
### E9. **The `splat3d` PRs 1-7 (May sprint) and the `codec` PRs are the SAME pipeline shifted 90°.**
273
292
293
+
> [Formalised post-merge as **R-1**: the unified pipeline lives in `ndarray::hpc::LinearReduce<T>`, decomposing into `Basis<T>` (basis-as-data; DCT, EWA splat, wavelet, k-means prototype all are `Basis<T>` impls) and `Reducer<T>` (the reduction: rANS-encode, alpha-composite, sum-reduce, softmax). The codec body dispatches via the trait and *never imports a specific basis impl* — this is what makes the "same pipeline shifted 90°" claim mechanically real.]
294
+
274
295
The splat3d forward pipeline is: project → tile-bin → mode-decide
275
296
(which Gaussian contributes at which pixel) → alpha-composite. The
276
297
codec pipeline is: build codebook → block-partition → mode-decide
@@ -468,6 +489,8 @@ codec for the manifold of predictable codebook-coded signals."*
> [Committed post-merge as **R-10**: sub-1-bit-per-token where the source distribution supports it (heavy-tailed residual after basin lookup). The mechanism is basin codebook (12-bit fingerprint → 4096 entries) + Gaussian-tail rANS, both already in scope. Falsifier: Plan G entropy bench at < 1.0 bit-per-token on the held-out Bbb/3DGS test corpus. See R-10 in the delta doc and `pr-x12-anti-neural-lookup-inversion.md` §3.1 for why this lookup-table substrate hits the Shannon bound within ε ≤ 0.2 dB.]
493
+
471
494
Stock 3DGS: ~250 bytes/Gaussian raw, ~50 bytes after PLY-trim.
472
495
PR-X12 mode-coded + A7 rANS: ~3-8 bits/Gaussian for the dominant
473
496
modes. **30-60× over current state of the art.** A 1M-Gaussian
0 commit comments