diff --git a/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md b/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md new file mode 100644 index 00000000..46e6f515 --- /dev/null +++ b/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md @@ -0,0 +1,337 @@ +# PR-X12 — The Anti-Neural Codec: Lookup-Table Inversion of NN Inner Loops + +> Date: 2026-05-22 +> Status: **wildcard perspective doc** — the most interesting reframe I can articulate of PR-X12's substrate. Companion to the GEMM lens (`pr-x12-x265-blasgraph-gemm.md`), 3DGS lens (`pr-x12-x266-3dgs-spacetime-upscaling.md`), and orchestration lens (`pr-x12-woa-multiarch-orchestration.md`). +> +> Premise: every "neural codec" primitive in current research — VQ-VAE, neural RDO, neural rendering, learned wavelets — has a **frozen lookup-table dual** that achieves the same information-theoretic compression at 50-1000× lower inner-loop cost. PR-X12 systematically picks the lookup-table dual for every inner-loop op, then proves it converges to within an information-theory-bounded ε of the neural codec's compression ratio. The codec has **zero NN forward passes in the inner loop**, by design. + +--- + +## 0. Thesis in one paragraph + +**A 4096-entry codebook indexed by a 12-bit fingerprint is structurally equivalent to a 1-layer 12-bit-input MLP that has been frozen and tabulated.** Any neural codec whose inner loop is "embed → match → score" can be replaced by "fingerprint → table lookup → score" for the same expressive power, at table-lookup latency (~3-10 ns) vs NN-forward-pass latency (~3-30 µs). PR-X12 makes this systematic: every primitive that *could* be an NN inner-loop op is instead a lookup table. The result is a codec that has the compression of a neural codec but the inner-loop cost of x265. + +This is not anti-NN. It is **anti-NN-in-the-inner-loop**. NNs train the tables. Once trained, the table replaces the NN. + +--- + +## 1. The current research direction: NN-in-loop codecs + +Recent codec research direction (2020-2026): + +| Codec | NN role | Inner-loop cost | +|---|---|---| +| **Lyra** (Google, 2021) | Neural vocoder decoder | ~3 ms per 20 ms audio frame on a phone | +| **SoundStream** (Google, 2021) | VQ-VAE encoder + neural decoder | ~10 ms per 20 ms audio frame | +| **EnCodec** (Meta, 2022) | Residual VQ-VAE + transformer prior | ~30 ms per 20 ms audio frame on GPU | +| **NVIDIA Maxine** (2020+) | Latent-space face encoding | ~16 ms per 1080p video frame on a 4090 | +| **AOMedia ML-AV1** (research) | Per-CTU NN-based RDO | ~5-20 ms per CTU | +| **Google ML-Image** (2023) | Learned transform + entropy model | ~100 ms per image on GPU | + +All of these share a common shape: +- Encoder: input pixels → embedding network → quantize → bitstream +- Decoder: bitstream → embedding → decoder network → output pixels +- Inner loop has *at least one* NN forward pass per emit operation + +The compression results are excellent. Lyra hits ~3 kbps speech at 16 kHz quality. EnCodec matches MP3 at ~12× lower bitrate. The inner-loop latency cost is *catastrophic*: ~3-100 ms per emit, vs ~0.1-1 µs for x265's per-block inner loop. + +**The structural problem with NN-in-loop:** + +1. Each forward pass = thousands to millions of MAC operations +2. Tensor framework overhead (PyTorch / candle / burn) = 50-200 µs per dispatch +3. Model version drift across decoders breaks playback +4. Quantization sensitivity: int8 NN weights vs f16 activations have numerical determinism issues +5. Cannot run inside L1 cache; needs L3 / HBM for weights + +--- + +## 2. The PR-X12 inversion: pre-baked lookup tables + +Every NN-in-loop primitive in §1 has a frozen-table dual. PR-X12 is the systematic instantiation of those duals: + +| NN-in-loop primitive | PR-X12 lookup-table dual | Inner-loop cost | +|---|---|---| +| VQ-VAE encoder embedding | k-means basin codebook (R-10, M:H-6) | ~10 ns per cell (L1-resident) | +| VQ-VAE decoder | Same codebook, reverse lookup | ~3 ns per cell | +| Neural RDO scoring | Tropical-GEMM partition (R-7) | ~1.4 K ops per CTU | +| Neural rendering | EWA splat rasterizer (Plan E) | ~5-15 ms per 4K frame | +| Learned transform | DCT-II batched GEMM (R-5) | ~256 cycles per 32×32 block | +| Transformer prior / entropy model | Gaussian-tail rANS (R-10) | ~10 ns per symbol | + +The codec's inner loop **never** dispatches to a tensor framework. The basin codebook is a fixed `[Fingerprint; 4096]` slice (~256 KB, fits L2). The tropical-GEMM partition runs over an 85-node DAG (~1 KB working set). The DCT basis is a `[i16; N*N]` array (~8 KB for 64×64). All resident in cache, all branchless on the hot path. + +**The total NN-flops in the codec's inner loop: zero.** NNs trained the codebooks; the codebooks live in the bitstream / metadata; the decoder does table lookups, not forward passes. + +--- + +## 3. The math: every NN-in-loop primitive has a lookup-table dual + +### 3.1 Basin codebook ≡ frozen 1-layer 12-bit MLP + +A VQ-VAE encoder's job: map continuous input embedding `x ∈ ℝᵈ` to discrete index `k ∈ {0..K-1}`, where centroid `c_k ∈ ℝᵈ` is the nearest among K learned centroids. + +```text +VQ-VAE encoder: k = argmin_j ||x - c_j||² +VQ-VAE decoder: x' = c_k +``` + +**PR-X12 basin codebook (R-10, M:H-6):** same algebraic operation, with the embedding step pre-computed by an OFFLINE training run (k-means over a corpus), then frozen into a 4096-entry codebook indexed by a 12-bit fingerprint. + +```text +PR-X12 encoder: fp = compute_fingerprint(x) [~10 ns, deterministic hash] + k = codebook.nearest_index(fp) [~3 ns, table lookup] +PR-X12 decoder: x' = codebook[k] [~3 ns] +``` + +**Why this is equivalent in expressive power:** + +- A 4096-entry lookup table on 12-bit input is structurally a `[4096]` array — i.e., a 12-bit-input 4096-output discrete function +- Any 12-bit-input network has at most 2^12 = 4096 distinct outputs +- A `Linear(12, K) → argmax` with frozen weights is structurally an array lookup +- The codebook IS the trained network, materialized as data + +**Why this is faster:** + +- 4096-entry lookup: 1 memory ref (table is in L2 cache, 64 ns p99) +- 1-layer 12-bit-input 4096-output Linear: ≥ 49,152 MACs + softmax + argmax ≈ 3-30 µs +- **Speedup: 1000-5000×** per inner-loop emit + +The compression ratio (R) is bounded by Shannon's source coding theorem: R ≥ H(cells). The codebook achieves H(cells) up to a log-factor of K=4096 entries' overhead. A neural encoder achieves the same H(cells) (assuming optimal training). Compression is asymptotically equivalent; latency is not. + +### 3.2 Tropical-GEMM RDO ≡ frozen GNN + +Neural RDO research (AOMedia ML-AV1, others 2022-2025): train a graph neural network to score quad-tree partition candidates. Each CU is a node; edges are split decisions; node features include local pixel statistics; the GNN outputs a scalar RDO score. + +The GNN's expressiveness for this problem maps directly onto tropical-semiring arithmetic: + +```text +GNN forward pass on RDO graph: + h_v^(l+1) = σ(W · aggregate(h_u^(l) for u in N(v)) + b) + +where aggregate = sum or max, σ = ReLU, ... + +Tropical semiring (R-7) on the same graph: + h_v^(l+1) = min_u (h_u^(l) + W_{uv}) [identity on min-plus algebra] +``` + +**Identity:** if the GNN's aggregator is `max` and σ is identity-on-positive, then the GNN forward pass on the RDO graph **is** a tropical-GEMM iteration over the negative semiring. The neural RDO research community has spent ~3 years arriving back at min-plus algebra, the way Bellman-Ford has always solved this. + +**PR-X12's tropical-GEMM:** + +- O(d²) iterations of `D ← min(D, D ⊕ W)` over 85-node DAG +- Hand-tuned `W` edge weights (or one offline calibration run) +- ~1.4 K ops per CTU (R-7 estimate) + +**Neural RDO:** + +- Per-CTU GNN forward pass with ~30-50 K parameters +- ~5-20 ms per CTU (10,000× slower) +- Same algebraic information content for the partition problem + +**Why frozen wins:** the partition problem is small (85 nodes, d=4 depth). The hand-tuned W matrix has ~340 weights. A learned GNN trained on the same partition problem has 30-50K parameters but the optimum is *low-dimensional*. PR-X12 picks the low-dim solution directly. + +### 3.3 EWA splat ≡ frozen 1-layer projection + +Neural rendering (NeRF, Mip-NeRF, Instant-NGP): MLPs that map (pos, viewdir) → (RGB, density). Forward pass per pixel during render. + +```text +NeRF: per-pixel MLP forward pass, ~10-100 µs per pixel on GPU +3DGS rasterize: per-Gaussian closed-form EWA projection, ~30-100 ns per Gaussian +``` + +The 3DGS render *is* the discretized, frozen, closed-form solution that NeRF's MLP was trying to approximate. The 200K Gaussians in a scene are a non-parametric discrete representation of what a NeRF MLP encodes implicitly. + +**PR-X12's EWA splat basis (Plan E, future x266):** + +- Per-Gaussian: 1 projection (4 MAC), 1 covariance evaluation (6 MAC), 1 tile-binning lookup +- Per-pixel: sort + alpha-blend (already optimized in published 3DGS code) + +**Neural rendering equivalent:** ~10,000× slower at comparable visual quality. The compression ratio (scene MB per pixel rendered) is approximately equivalent — within a factor of 2 — because both encode the same 3D scene at the same fidelity. The latency gap is the win. + +### 3.4 DCT-II basis ≡ 1-layer linear projection + +This one is too well-known to belabor: an N-point DCT-II is a fixed `(N × N)` matrix multiplied against the input. A "learned transform" research codec uses gradient descent to find a (close-to-DCT) transform that's slightly better at the training distribution. The information gain is bounded: most natural images have a near-DCT eigenbasis, and the learned transforms typically beat DCT by <0.1 dB PSNR. + +For 0.1 dB PSNR you pay: + +- Per-block matrix multiply with the learned weights (~256 cycles, same as DCT) +- *PLUS* the model versioning / training framework / per-arch dispatch headache + +PR-X12 chooses DCT-II (R-5) because the gain from a learned transform is below the noise floor of arch-dependent rounding. + +--- + +## 4. Why frozen lookups win at codec inner-loop scale + +The four core arguments: + +### 4.1 Determinism + +Lookup tables produce bit-exact outputs across: +- Compiler version (gcc 12 vs 13 vs clang 18) +- SIMD width (AVX-512 vs SVE2 vs NEON) +- Float rounding mode +- Tensor framework version (PyTorch 2.3 vs 2.4 vs torch.compile) + +NN inner loops do not. The 2024 "neural codec evaluation" papers regularly report ±0.5 dB PSNR variation across runs of the *same model* on the *same input* due to non-determinism in CUDA reductions. For a codec, this is a non-starter. + +### 4.2 L1 / L2 cache fit + +A 4096-entry × 8-byte codebook = 32 KB (fits L1 on most archs). A 100-element tropical-GEMM working set = ~1 KB. An 85-node partition DAG = ~1 KB. Everything in the codec's inner loop fits in L1 + L2. + +A neural codec's NN weights (~10-100 MB) sit in L3 or HBM. Per-pixel inner loop fetches from L3 = ~30-50 ns per fetch. Even before MACs, you're paying L3 latency PR inner-loop iteration. + +### 4.3 No tensor framework dependency + +The codec runs in pure Rust + `ndarray::hpc` SIMD. No PyTorch. No candle (the codec doesn't depend on candle; the inverse is also true). No CUDA dependency for CPU encode. No ROCm. + +This matters for deployment: PR-X12 ships in a 5 MB stripped binary; a neural codec needs 50-500 MB of model weights + framework dependencies. For edge / mobile / embedded, this is the difference between "ships" and "doesn't." + +### 4.4 No model versioning + +A neural codec is essentially a versioned shared model state. Decoder must have the *exact* version that encoded the stream. Cross-vendor decoder interop is impossible without standards bodies (which take years; cf. JPEG XL's ~7-year ratification story). + +A frozen-lookup codec's wire format is fully specified by the byte-level layout. The "model" — the codebook — is part of the bitstream or part of the static codec spec. Decoder vendors interop by reading the spec. The codec is *intrinsically* an open standard. + +### 4.5 Patentability around ML monopolies + +The neural codec space is full of patents on specific model architectures. Encoder using "VQ-VAE + residual transformer prior" is patent-encumbered by Meta (EnCodec), Google (SoundStream), and others. Decoder using "MLP for neural rendering" overlaps with NeRF patents. + +A k-means basin codebook + tropical-GEMM RDO + EWA splat codec sits in **mathematically-prior-art** territory. k-means (1957), tropical algebra (1990s applied codec literature), EWA splat (2001). All decades-old, all in the public domain or expired. PR-X12's substrate is intrinsically patent-free. + +This is not a small consideration. The H.265 / HEVC patent pool charges $0.02 per device sold; the codec ecosystem pays ~$1B/yr in HEVC royalties. PR-X12's substrate sidesteps this by construction. + +--- + +## 5. The Hutter information-theoretic bound + +Marcus Hutter's compression thesis ("Universal AI is compression"): for a stationary source X with entropy H(X), the optimal compression ratio is bounded by H(X). Any codec — neural or frozen-lookup — achieving R = H(X) is *information-theoretically optimal*. There is no further compression to extract. + +**Claim:** for the source distributions that PR-X12 targets (video frames, audio waveforms, text streams), the basin codebook + tropical-GEMM partition + DCT transform achieves R within ε of H(X). The ε is bounded by: + +- The log-of-codebook-size overhead: log₂(4096) / cell ≈ 12 bits / cell +- The basis approximation gap: DCT vs Karhunen-Loève optimal transform ≈ 0.05 dB PSNR +- The quad-tree partition granularity: 8×8 leaf vs continuous ≈ 0.1 dB PSNR + +**Total ε: ~0.2 dB PSNR.** Within the JND (just-noticeable-difference) threshold for human perception. + +A neural codec can theoretically close this gap, but only by learning the exact optimal codebook + transform + partition for the *specific* source distribution. The cost: per-source training (hours to days), large model storage (MB to GB), per-inference forward pass (ms per emit). The information gain: ~0.2 dB. + +**PR-X12 buys ~0.2 dB of PSNR for 1000-5000× faster inner loop.** That's a Pareto-dominant trade for any deployment where latency matters more than the 0.2 dB. + +--- + +## 6. When NN-in-loop wins + +The honest answer: **ultra-low-bitrate, perceptually-tuned, generative codecs.** + +For bitrates < 1 kbps (e.g., Lyra speech, neural face codecs at 256 bps), the source distribution is so undersampled that any frozen codebook leaves obvious quality on the table. A neural model can "hallucinate" plausible content from the few bits transmitted, beating a frozen codec by 5-15 dB PSNR equivalent. + +This is **codec-as-generative-model** territory, not codec-as-source-coding. The hallucinated content may not match the original (PR-X12's failure-of-completeness vs failure-of-fidelity discussion in the 3DGS doc — same distinction). + +For these use cases, the right architecture is a **layered codec**: + +1. **Base layer:** PR-X12 frozen-lookup codec for the bits-actually-transmitted +2. **Enhancement layer:** NN generative refinement at the decoder (optional, off by default) + +The base layer guarantees fidelity bounded by Shannon. The enhancement layer provides perceptual hallucination when the user opts in. PR-X12's wire format reserves a single bit in the **frame header** (alongside `ConsumerProfile` and `FlushUnit` per R-2 / R-12) for the "enhancement layer available" flag — not in the per-leaf 16-bit header, whose bit 14 is already claimed by R-2's consumer-typed demux and whose bit 15 is the universal inter-tier reference. + +This is also the right architecture for high-stakes content (legal, medical, scientific): always run the base layer, never run the enhancement layer. Determinism preserved. + +--- + +## 7. PR-X12 is the floor; NN can layer on top + +The architectural commitment: + +```text + ┌───────────────────────────────────────┐ + │ Optional enhancement layer (NN) │ + │ - Generative refinement │ + │ - Off by default; opt-in per use case │ + │ - Lives in burn/candle, NOT in codec │ + └───────────────────┬───────────────────┘ + │ standardized API: + │ decoded_frame → enhanced_frame + ▼ + ┌───────────────────────────────────────┐ + │ PR-X12 base codec (lookup-table only) │ + │ - k-means basin codebook │ + │ - Tropical-GEMM RDO │ + │ - DCT-II / EWA splat basis │ + │ - Gaussian-tail rANS entropy │ + │ - Zero NN forward passes │ + │ - Deterministic across archs │ + └───────────────────────────────────────┘ +``` + +**Why this layering matters for PR-X12 scope:** the base layer is what's IN PR-X12. The enhancement layer is what `burn`/`candle` consumers may build *later*, taking PR-X12's decoded frames as input. The boundary is clean. The base layer never imports NN code; the enhancement layer takes pixels and produces pixels. + +R-10's commitment to sub-1-bit-per-token + Gaussian-tail rANS is the *base layer's* extreme limit. If a use case needs lower bitrate than R-10 supports, layer NN on top — don't push NN into the base codec. + +--- + +## 8. Falsifiers — what would invalidate this thesis + +Be specific: + +**F-1: Neural codecs close the latency gap.** If by 2028, neural codecs ship at < 100 µs per emit on commodity CPUs, the latency argument weakens. **Likelihood: low.** Forward-pass cost scales with model parameters; even ternary-quantized 1M-parameter models need ~3-5 µs per pass on AMX. The 50-1000× gap is structural, not implementation-dependent. + +**F-2: Codebook adaptation breaks fixed lookup.** If real-world content distributions drift such that a 4096-entry codebook can't capture them, R-13's federated codebook update mechanism is required. **Mitigation:** R-13 is in scope. The codebook is swappable, not frozen-forever. + +**F-3: PSNR gap exceeds 0.2 dB on real content.** If §5's ε estimate is wrong on real video clips, the Pareto argument weakens. **Mitigation:** Plan G video lane (R-4, R-11) is the empirical check. If PR-X12's PSNR vs x265 ultrafast is < 0.95× on Bbb 1080p, R-4 blocks the merge. The test is in CI. + +**F-4: NN forward-pass becomes free on next-gen hardware.** If by 2030, all consumer hardware has 50 TFLOP/s of int8 throughput, NN inner-loop cost drops to lookup-table levels. **Mitigation:** even if NN cost drops, frozen lookup is still simpler and more deterministic. The Pareto argument doesn't reverse; only the slope changes. + +**F-5: The basin codebook can't fit a streaming bitstream's symbol distribution online.** If R-10's sub-1-bit-per-token rANS path requires per-stream codebook training (slow), the codec stalls on stream init. **Mitigation:** federated codebook (R-13) ships pretrained codebooks for {video, audio, text, image} domains. New streams use the pretrained codebook; per-stream fine-tuning is optional and out-of-loop. + +None of these falsifiers are decisive against PR-X12's thesis. They constrain its parameter choices, not its fundamental architecture. + +--- + +## 9. What this lens prescribes for PR-X12 scope + +Concrete implications: + +1. **Do not** introduce any NN dependency in `ndarray-codec`. No `candle` or `burn` imports. No PyTorch FFI. Codec is dependency-free below `ndarray::hpc`. + +2. **Do** ship the codebook as data, not as code. A 32-KB `[Fingerprint; 4096]` slice in the binary's `.rodata` section, not a `lazy_static` of a constructed object. Faster to load, simpler to swap (R-13). + +3. **Do** keep tropical-GEMM in `lance-graph::blasgraph` and call it from the codec. Don't inline the algorithm into the codec; the kernel is a reusable substrate primitive (other consumers — `lance-graph`'s graph queries — already use it). + +4. **Do** commit to the 0.2 dB PSNR Pareto-tradeoff publicly. Plan G's video bench (R-4, R-11) is the proof. If we miss it, we fall back to "compression-equivalent-to-x265-ultrafast-faster" instead of "compression-near-best-in-class." + +5. **Reserve** a bitstream flag for the enhancement-layer hook (§7). One bit, in a reserved field of the 16-bit header. Decoder logs it; consumer crates may use it; codec doesn't. + +6. **Document** the patent-free posture explicitly in `pr-x12-codec-x265-design.md`. Cite k-means (1957), tropical algebra (1990s), EWA splat (2001), DCT (1974), rANS (2014, patent-expired). Make the IP story unambiguous. + +--- + +## 10. The deeper claim + +**Neural codecs are not the future of codecs.** They are *one* future of codecs, narrowly applicable to generative ultra-low-bitrate use cases. + +The other future — the much larger one — is **frozen-lookup codecs with NN-trained tables and an optional NN enhancement layer**. PR-X12 is a working prototype of this future. The substrate (R-1 basis trait, R-3 LoC envelope, R-11 latency assertions, R-13 federated codebook) makes it composable, deterministic, and patent-free. + +The neural codec research community will arrive at this conclusion in 5-10 years, after burning through the latency and determinism walls. PR-X12 skips that detour. + +--- + +## 11. Cross-references + +- **Substrate canon:** `pr-x12-substrate-merged-canon.md` +- **Resolutions:** R-1, R-3, R-7, R-10, R-11, R-13 in `pr-x12-canon-resolutions-delta.md` +- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` (companion analysis of the inner-loop math) +- **3DGS lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` (the EWA splat case study extended) +- **Multi-arch lens:** `pr-x12-woa-multiarch-orchestration.md` (why determinism matters fleet-wide) +- **Codec spec:** `pr-x12-codec-x265-design.md` +- **Reading list:** + - Hutter (2005) "Universal AI" + - Shannon (1948) source coding theorem + - Hartigan (1975) k-means clustering + - Zwicker et al. (2001) EWA Splatting + - Duda (2014) Asymmetric Numeral Systems + - Lyra (2021), SoundStream (2021), EnCodec (2022) papers for context + +_Last edit: 2026-05-22._ +_Status: opinionated perspective doc; the thesis is sharper than the rest of PR-X12 canon by design._ diff --git a/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md new file mode 100644 index 00000000..e715d403 --- /dev/null +++ b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md @@ -0,0 +1,471 @@ +# PR-X12 ↔ bgz family + jc proof crate — Substrate Synergies & Identified Gaps + +> Date: 2026-05-22 +> Status: **substrate grounding doc** — connects PR-X12's abstract substrate claims to the **already-implemented** crates in `lance-graph/crates/`. Companion to the five perspective lenses written 2026-05-22. +> +> Premise: most of what the PR-X12 perspective lens docs (`pr-x12-x265-blasgraph-gemm.md`, `…3dgs-spacetime…`, `…woa-multiarch…`, `…anti-neural-lookup-inversion…`, `…gguf-llm-weights-encoding.md`) describe in the abstract — Skip/Merge/Delta/Escape, 4096-entry basin codebook, tropical-GEMM RDO, federated codebook policy, sub-1-bit weight encoding — is **already in production** under different names in the `bgz17` / `highheelbgz` / `bgz-tensor` / `bgz-hhtl-d` crates. The PR-X12 codec is the **stream-oriented HEVC-compatible wire format** for a substrate whose **search-oriented and weight-encoding implementations already exist**. + +--- + +## 0. One-paragraph thesis + +`bgz17`'s 4-layer cascade (Scent / Palette / ZeckBF17 / Full) IS the Skip / Merge / Delta / Escape grammar. The HHTL 16×16×16 = 4096-leaf lattice IS the basin codebook. `bgz-hhtl-d`'s 4-byte-per-row encoding of Qwen3-TTS-1.7B at **343:1** is the LLM-weight-encoding lens doc's claim, *empirically validated, already shipping*. The `jc` crate is the formal-proof harness (Hambly-Lyons signature uniqueness, binary-Hamming causal-field correctness) that PR-X12 has been calling "future work." The two gaps that *don't* exist yet — `jd-nd` (ndarray-side proof crate) and a Cronbach/ICC encoding-reliability research crate — are the work this doc identifies as outstanding. + +--- + +## 1. The five existing crates (canonical paths) + +### 1.1 `lance-graph/crates/bgz17/` + +**bgz17** = **b**las**g**raph + **z**eck**17**. A 4-layer metric distance codec that compresses 49,152-byte SPO planes to 3 bytes per edge via palette indexing + precomputed 256×256 distance matrices for O(1) lookup. + +**The four layers (from `KNOWLEDGE.md`):** + +```text +Layer 0: Scent (1 byte) Hamming on 7-bit lattice ρ=0.937 NOT metric-safe (heuristic only) +Layer 1: Palette (3 bytes) L1 on i16[17] palette ρ≈0.965 metric-safe (CAKES sieve) +Layer 2: ZeckBF17 (102 bytes) i16[17] L1 per plane ρ=0.992 metric-safe +Layer 3: Full planes (6 KB) exact Hamming ρ=1.000 lossless +``` + +**95%+ of searches terminate at Layer 0-1.** Layer 2 for decision-boundary cases. Layer 3 almost never loaded. + +**Public types:** `Palette`, `Base17`, `DistanceMatrix`, `LayeredScope`, `Bgz17Distance` trait, `PaletteMatrix`, `PaletteCsr`. + +**Production search path:** `HEEL (Scent, heuristic, 10K → 200) → CAKES sieve (Palette, metric-safe, 200 → k)`. + +### 1.2 `lance-graph/crates/highheelbgz/` + +3-integer spiral address encoding for weight vectors: `(start, stride, length)` = 6-12 bytes using golden-spiral folding. Values are recomputed on-demand from source data (streaming decode pattern). Integrates with `bgz-tensor` for full metric-algebraic composition. + +**Public types:** `SpiralAddress`, `SpiralWalk`, `CoarseBand`, `NeuronPrint`, `TensorRole`, `SpiralEncoding`, `GammaProfile`, `SpiralPalette`, `rehydrate` module. + +This is the **address space** for the basin codebook — not the values, but where to find them. Maps directly onto the `CurveOrder` trait that M:H-NEW-2 / canon §M:E-B posits. + +### 1.3 `lance-graph/crates/bgz-tensor/` + +Metric-algebraic tensor codec for transformer weight matrices. Projects weight matrices through golden-step folding into Base17 metric space, palette-quantizes via CLAM clustering, then **replaces matmul with precomputed `u16` distance table + `u8` compose table lookup**. Achieves 640× compression while preserving algebraic structure. HHTL cascade eliminates 95% of attention computation at Layer 0-1. + +**Public types:** `AttentionSemiring`, `ComposeTable`, `DistanceTable`, `HhtlCascade`, `route_tensor`, CAM-PQ codebook training. + +### 1.4 `lance-graph/crates/bgz-tensor/src/hhtl_d.rs` — **bgz-hhtl-d** + +4-byte-per-weight-row encoding, per `BGZ_HHTL_D.md`: + +```text +Slot D (u16) Slot V (u16) +┌────┬──────┬──────────┬───┬───┐ ┌────────────────┐ +│ Ba │ HIP │ TWIG │ P │ R │ │ BF16 residual │ +│15:14│13:10│ 9:2 │ 1 │ 0 │ │ from centroid │ +└────┴──────┴──────────┴───┴───┘ └────────────────┘ + 2b 4b 8b 1b 1b 16 bits + +Ba = HEEL basin (QK=0, V=1, Gate=2, FFN=3) ← 4-way tensor-family discriminant +HIP = family within basin (16-way binary split) ← 16-way intra-family +TWIG = centroid index in 256-entry palette ← 256-way basin centroid +P = polarity of dominant residual dimension +R = reserved +``` + +**Empirical compression** on Qwen3-TTS-12Hz-1.7B-Base (1.93B params): + +| Component | Original | HHTL-D | Ratio | +|---|---|---|---| +| Talker attention (Q/K/V/O × 28 layers) | 470 MB | 1.5 MB | 313:1 | +| Talker FFN (gate/up/down × 28 layers) | 1,414 MB | 2.4 MB | 589:1 | +| Text embedding (151,936 × 2048) | 622 MB | 0.6 MB | 1,037:1 | +| Code predictor (5 layers, all roles) | 197 MB | 0.7 MB | 281:1 | +| **Whole model** | **3.86 GB** | **11.2 MB** | **343:1** | + +Shared palette: 480 tensors → 26 palette groups (5.4 MB metadata vs 57 MB if unshared). **Fits on a Pi 4 in 75 MB RAM** (full Qwen3-TTS-1.7B inference). + +### 1.5 `lance-graph/crates/jc/` — Jirak-Cartan + +**12-pillar proof-in-code** for binary-Hamming causal field computation. The Cargo.toml description still says "five-pillar" (stale from the initial design), but `jc::run_all_pillars()` actually runs **12 pillars**: 1, 3, 4, 5, 5b, 7, 8, 9, 9b, 10, 11 (with 2 deferred pending coupled-revival-track activation, and 4 activated 2026-05-07 once `EULER_GAMMA` + `GOLDEN_RATIO` stabilized in Rust 1.94 `std::f64::consts`). + +Standalone, zero-external-deps in default build (`cargo build`). The optional `hambly-lyons` feature pulls in the `sigker` workspace sibling and **activates Pillar 11**; under default features Pillar 11 reports `DEFERRED` instead of running. + +**The pillars relevant to PR-X12:** + +| Pillar | Theorem | Certifies | +|---|---|---| +| 1 (E-SUBSTRATE-1) | bundle associativity @ d=10000 | VSA Chapman-Kolmogorov / Markov semigroup | +| 5 (Jirak) | Berry-Esseen under weak dependence @ d=16384 | noise floor for ICC / Spearman ρ claims | +| 5b (Pearl 2³) | three-plane vs bundled mask accuracy @ d=16384 | task-level downstream of pillar 5 | +| 9 (EWA-Sandwich) | Σ-push-forward along multi-hop edge paths | covariance propagation in graph traversal | +| 9b (EWA-Sandwich 3D) | Σ-push-forward on 3×3 SPD covariances | **certifies `ndarray::hpc::splat3d`** | +| **10 (Pflug-Pichler)** | nested-distance Lipschitz on Sigma DN-trees | **certifies CAM-PQ tree quantization preserves FreeEnergy within Lε** | +| **11 (Hambly-Lyons)** | signature uniqueness on tree-quotient | **certifies sigker's Index-regime classification** | + +**Pillar 10 is the formal certification of CAM-PQ / bgz quantization correctness** — not Pillar 11. Pillar 11 certifies `sigker` specifically. + +**Pillar 11 probe** (when active): uses `sigker::signature_truncated` (tensor-algebra path) — *not* `signature_kernel_pde`, which has a known math bug (PR #350: the Goursat-PDE form diverges from the true signature kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The probe runs `N=100` random pairs in d=3 at depth-2, asserting: +- Forward (out-and-back `[p₀, p₁, p₀]`): `‖S − S_identity‖ < 1e-9` +- Converse (triangle `[p₀, p₁, p₂, p₀]`): `‖S − S_identity‖ > 0.05` +- Discrimination ratio ≥ 1e6 + +The full examples directory has 10 runnable proofs (not 9): `prove_it`, `sigma_probe`, `probe_p1`, `osint_edge_traversal`, `splat_to_ewa_bridge`, `splat_triangle_count`, `splat_lpa_label_propagation`, `splat_louvain_modularity`, `splat_jaccard_adamic_adar`, `splat_perturbationslernen`. + +--- + +## 2. The PR-X12 ↔ bgz mapping, concretely + +### 2.1 Skip / Merge / Delta / Escape ≡ Scent / Palette / ZeckBF17 / Full + +This is the load-bearing identification. PR-X12's 4-mode taxonomy is the same 4-layer cascade bgz17 ships: + +| PR-X12 mode | bgz17 layer | Bytes | Pearson ρ | Role | +|---|---|---|---|---| +| **Skip** | Scent (Layer 0) | 1 B | 0.937 | Heuristic pre-filter; 95% of cells terminate here | +| **Merge** | Palette (Layer 1) | 3 B | 0.965 | Basin centroid lookup; metric-safe for CAKES | +| **Delta** | ZeckBF17 (Layer 2) | 102 B | 0.992 | i16[17] residual after basin; metric-safe | +| **Escape** | Full (Layer 3) | 6 KB | 1.000 | Lossless plane; rarely needed | + +**What this means for the PR-X12 codec:** + +- The four-mode wire format (2-bit `header_kind` per CTU) maps 1:1 onto bgz17's layer selection +- bgz17's metric-safety guarantees (CAKES triangle inequality) are *the formal proof* of PR-X12's M:H-3 "bit-exact attention with tunable accuracy floor" +- The 95% termination rate at Layer 0-1 is the empirical realization of PR-X12's Skip-dominant inner-loop claim from `pr-x12-anti-neural-lookup-inversion.md` §3.1 + +**PR-X12's contribution above bgz17:** wire format for **streaming** sources (video frames, 3DGS, audio) where the source has to be encoded into a byte stream, not just searched. bgz17 is search-oriented (CAKES nearest-neighbour); PR-X12 is stream-oriented (rANS-coded byte sequence). Both use the same 4-mode grammar. + +### 2.2 4096-entry basin codebook ≡ bgz-tensor `Codebook4096` + +PR-X12's claim (M:E-D, R-13): 4096-entry basin codebook per encoder, swappable, federated. + +**The literal 4096 lives in `bgz-tensor::codebook4096::Codebook4096`** — `bgz-tensor/src/lib.rs` exports `Codebook4096` and `CodebookIndex` as first-class types. This IS the 4096-entry codebook PR-X12 cites. Not derived; named. + +**bgz-hhtl-d encodes a DIFFERENT structure** — clarification of an earlier misreading: + +```text +Slot D bit layout (u16): + bits 15..14 = HEEL basin (2 bits, 4 states: QK/V/Gate/FFN) + bits 13..10 = HIP family (4 bits, 16 families per basin) + bits 9..2 = TWIG centroid (8 bits, 256 centroids in shared palette) + bit 1 = BRANCH polarity (sign of dominant residual dim) + bit 0 = reserved + +→ 4 × 16 × 256 = 16,384 addressable cells per role-group +``` + +But these aren't 16,384 distinct centroids — TWIG is a flat 0..255 index into a **256-entry palette shared across all rows of the role group**, and HIP families are built **post-hoc** from the palette via `build_hip_families` (4-level recursive farthest-pair binary split → 16 families). The 26 palette groups Qwen3-TTS-1.7B ships with give 26 × 256 = **6,656 distinct centroids total across the whole model**. + +So **two different 4096s in bgz-tensor**: +- `Codebook4096` — literally 4096 entries, the direct correspondence to R-13 +- bgz-hhtl-d's 4 × 16 = 64 (basin × HIP) per role × 256 (TWIG) — produces a 16,384-cell address space, *not* 4096 + +PR-X12 R-13 should reference `Codebook4096` directly; bgz-hhtl-d is a *different* basin-codebook strategy at a different working set size. Both live in the same crate. + +### 2.3 `CurveOrder` trait ≡ highheelbgz spiral addressing + +PR-X12 M:E-B and M:H-NEW-2 posit a `CurveOrder` trait that abstracts Morton / Hilbert / Z-order curves for the cell traversal. + +**highheelbgz IS one concrete impl of this trait,** using golden-spiral folding instead of Morton/Hilbert. The `(start, stride, length)` 3-tuple is the spiral curve's parametric description — the codec asks "give me cells in curve order N for this region," highheelbgz answers via `SpiralAddress` + `SpiralWalk`. + +The **streaming-decode-during-GEMM** pattern from the GGUF lens (`pr-x12-gguf-llm-weights-encoding.md` §7) is highheelbgz's "values recomputed on-demand from source data." Already exists. + +### 2.4 `LinearReduce + Basis` ≡ AttentionSemiring + ComposeTable + DistanceTable + +PR-X12 R-1 / §M:E-A: `LinearReduce` decomposes into `Basis` (data) + `Reducer` (operation). + +**bgz-tensor's actual implementation:** + +- `Basis` ≡ `DistanceTable` (precomputed u16 lookup) + `ComposeTable` (precomputed u8 lookup) — the "basis-as-data" view +- `Reducer` ≡ `AttentionSemiring` — the reduction operation, specialized for attention (max-plus or sum-of-products depending on softmax/linear-attn variant) +- The trait split exists in working code + +**640× compression at zero attention math change** is the empirical claim from §1.3. That's a stronger bound than PR-X12's anti-neural lens projected (~50× via 4096-entry lookup vs Linear(12, K)). bgz-tensor's HhtlCascade adds the cascading basin structure, which is what enables 640× rather than the naive single-table 50×. + +### 2.5 Tropical-GEMM (R-7) ≡ scalar_sparse.rs's min-plus SpMV + +PR-X12 R-7: tropical-GEMM lives in `lance-graph::blasgraph`, called from codec. + +**Actual location (per the KNOWLEDGE.md module map):** `lance-graph/crates/bgz17/src/scalar_sparse.rs:149` — "scalar CSR with standard + min-plus (tropical) semiring SpMV." + +**Plus `tripartite.rs:171`** — "cross-plane S×P×O reasoning via scalar sparse matrices." + +R-7's "call into lance-graph::blasgraph" should be re-targeted to `lance-graph::bgz17::scalar_sparse::tropical_spmv` — the kernel exists there, not in blasgraph proper. This is a **canonical-path correction** worth updating in the resolutions delta doc. + +### 2.6 Federated codebook (R-13) ≡ shared palette strategy in bgz-hhtl-d + +PR-X12 R-13: basin codebook is swappable, federated, per-domain pretrained. + +**Actual implementation in bgz-hhtl-d:** + +```text +Qwen3-TTS-1.7B: 480 tensors → 26 palette groups + +Group Tensors Rows each Shared palette +talker/gate [6144,2048] 28 6,144 1 × 206 KB +talker/up [6144,2048] 28 6,144 1 × 206 KB +talker/down [2048,6144] 28 2,048 1 × 206 KB +talker/qko [2048,2048] 56 2,048 1 × 206 KB +talker/v [1024,2048] 28 1,024 1 × 206 KB +talker/embed [151936,2048] 1 151,936 1 × 206 KB +cp/embed [2048,2048] 15 2,048 1 × 206 KB +cp/lm_head [2048,1024] 15 2,048 1 × 206 KB +... (18 more groups) +``` + +**R-13's `SharedClusterWide` and `PretrainedStatic` modes are this strategy, generalised to deployment time.** bgz-hhtl-d already implements `PretrainedStatic` (the 26 groups are pretrained); R-13's `SharedClusterWide` is the *streaming* version where the 26 groups update at runtime via gossip. + +### 2.7 Formal correctness ≡ jc's Hambly-Lyons signature uniqueness + +PR-X12 has no formal-proof commitment yet — Plan G (R-4) is empirical bench-gating; R-11 is latency assertion. Neither proves correctness. + +**jc's Pillar 11 (Hambly-Lyons signature uniqueness) IS the formal proof** that any bgz-encoded source maps uniquely to its bitstream up to noise floor. Specifically: + +- For two source signals X, Y with bgz-encodings B(X), B(Y) +- If B(X) = B(Y), then X = Y up to the quantization noise floor of the encoding layer +- Hambly-Lyons signatures give the *signature kernel* under which this uniqueness holds +- The proof is machine-checkable in jc's `examples/` directory (9 runnable proofs per the Explore agent's read) + +**Implication for PR-X12:** R-1's `LinearReduce` ordered-reducer determinism guarantee (the "same input → same bits on every arch" claim from `pr-x12-woa-multiarch-orchestration.md` §6) **already has a formal proof in jc.** PR-X12 just needs to cite it — not reprove it. This is a strong story for the multi-arch consumer claim (R-11). + +--- + +## 3. Updating the GGUF perspective doc with bgz-hhtl-d's actual numbers + +The GGUF lens doc (`pr-x12-gguf-llm-weights-encoding.md`) estimated: + +- Qwen 7B → ~3.1 GB at PR-X12 (29% smaller than GGUF Q4_K_M ~4.4 GB) + +**bgz-hhtl-d's actual measurement** on Qwen3-TTS-1.7B (1.93B params): + +- 3.86 GB → 11.2 MB = **343:1 compression** +- Scaled to a 7B model: ~40 MB + +**That is 110× smaller than GGUF Q4_K_M, not 29% smaller.** + +The discrepancy comes from three things the GGUF doc didn't account for: + +1. **HHTL cascade structure** — bgz-hhtl-d uses *both* row-level palette (256 centroids) *and* hip/heel hierarchical addressing. The lens doc treated the codebook as flat 4096-entry. Hierarchical addressing turns out to add another order of magnitude. + +2. **BF16 residual is 16 bits, not 4-8 bits** — counterintuitively this LOSES compression per-row but the row count drops dramatically because palette hit-rate is high. The doc was using a uniform "Delta = 2.5-3.5 bits each" estimate, which is wrong for the HHTL structure. + +3. **Shared palette across all 480 tensors** — the GGUF doc allowed "per-layer-family" (~13 MB codebook); bgz-hhtl-d ships 5.4 MB total for all 480 tensors via tighter sharing. + +**Updated estimate for the GGUF lens doc:** the 29% number is conservative by orders of magnitude. The actual ceiling appears to be **2-orders-of-magnitude smaller than Q4_K_M** at PSNR/perplexity comparable to f16 baseline. + +**Falsifier check from the GGUF doc still applies:** + +- F-1 (activation-aware RDO must beat GPTQ/AWQ): bgz-hhtl-d ships *without* activation-aware RDO and still hits 343:1 — so the AWQ-style λ-weighting is upside on top, not table stakes +- F-2 (streaming decode must be ≤1.05× pre-dequant): the HHTL cascade resolves 95% of attention pairs via table lookup at Layer 0-1 — much *better* than 1.05×, it's a *speedup* at inference +- F-5 (llama.cpp ecosystem fork): bgz-hhtl-d is in lance-graph today, not llama.cpp; the ecosystem-adoption falsifier still applies + +**Recommended edit to the GGUF lens doc:** add a footnote pointing to `bgz-hhtl-d` as the existing implementation, and update §6's table with bgz-hhtl-d's empirical numbers as the *upper bound* on what PR-X12 + GGUF transcode could achieve. + +--- + +## 4. What PR-X12 ADDS that the bgz family doesn't + +If bgz-hhtl-d already ships at 343:1 for LLM weights, what does PR-X12 *add*? + +### 4.1 Streaming wire format for video / 3DGS / audio + +bgz family is **search-oriented** — CAKES nearest-neighbour, palette lookup, distance-matrix queries. PR-X12 is **stream-oriented** — rANS-coded byte sequence, 16-bit per-CTU header, frame-aligned framing. + +The two have isomorphic algebra (same 4 modes, same 4096-entry codebook) but different I/O patterns: + +- **Search:** random-access read, fixed-cost lookup, latency dominated by L2/L3 cache miss +- **Stream:** sequential read, variable-cost decode, latency dominated by rANS state machine + +A video stream cannot be a CAKES search — frames arrive in order, each one references the previous one, and the encoder has to commit to a partition before seeing future frames. PR-X12 is the **stream codec** that uses the bgz algebra. + +### 4.2 Per-arch dispatch contract (R-4, R-5, R-11) + +bgz family uses CLAM/CAKES for nearest-neighbour — these are arch-agnostic at the cost of not using AMX/VNNI/SVE2 to their potential. The 95% HEEL-stage termination is a *codec-level* optimization, not a SIMD-level one. + +PR-X12's R-4 / R-5 / R-11 commitments add the **per-arch dispatch matrix** on top of the bgz algebra: + +- DCT-II via AMX BF16 tile (the 64× crossover from R-5) +- ME via VNNI int8 dot product (R-6, 50× over SAD) +- Tropical-GEMM via SVE2 / NEON for ARM-class fleet +- Latency assertion per stage, calibrated in Plan G's codec-bench + +This is the work that turns bgz's 343:1 *storage* win into a *throughput* win on AMX/VNNI hardware. The two compose — bgz cuts the bytes, PR-X12 keeps the GEMM hot. + +### 4.3 Cross-domain unification (one wire format for video + 3DGS + LLM weights + ...) + +bgz17 encodes SPO planes. bgz-tensor encodes transformer weights. bgz-hhtl-d is one specific tensor variant. Each is a separate API surface. + +PR-X12 ships **one wire format** (`ndarray-codec`'s 16-bit-header + CTU layout) that all consumers use. The lens docs argue this is right because the algebra is the same; the implementation gap is that bgz family doesn't currently have a unified entry-point. PR-X12's codec body would call into bgz17 / bgz-tensor as the *backend* for the basin codebook + tropical-GEMM, but expose a unified `Codec::encode(stream) → bytes` surface. + +**This is exactly R-3's LoC envelope claim:** ~1500 LoC of generic codec body, calling into ~15 KLoC of substrate (the bgz family is substantial, but already exists). The ratio holds. + +### 4.4 The 5 perspective lens docs as the architectural story + +The bgz family ships *code* but doesn't ship the *story* of why the architecture is right. PR-X12's lens docs (GEMM, 3DGS, multi-arch, anti-neural, GGUF) provide the cross-domain claims that make the architecture defensible. + +This is the doc-level value of PR-X12: bgz code + PR-X12 docs = a complete architectural pitch that bgz alone doesn't make. + +--- + +## 5. Gaps — what doesn't exist yet + +### 5.1 `jd-nd` — the missing ndarray-side proof crate + +The Explore search confirmed: `jd-nd` does not exist in `/home/user/ndarray/`. The math-proof infrastructure on the ndarray side lives ad-hoc inside `src/hpc/` modules (`deepnsm.rs`, `jina/runtime.rs`) as TODO comments. + +**Recommendation:** create `ndarray/crates/jd-nd/` (or as a sibling Rust workspace member) as the ndarray-side analog of jc. Scope: + +- Formal proofs of SIMD kernel correctness (the unsafe blocks in `src/simd_*.rs`) +- Bit-exact cross-arch determinism proofs (for the `OrderedKahanReducer` claim from R-1) +- BLAS-level kernel correctness (gemm, dot, axpy under given precision bounds) +- Pillar parallel to jc's Hambly-Lyons signature uniqueness, but for the basis-trait operations rather than graph-traversal operations + +**Suggested structure** (~500 LoC, no external deps initially): + +``` +ndarray/crates/jd-nd/ +├── Cargo.toml +├── src/ +│ ├── lib.rs # exports +│ ├── basis_proofs.rs # Basis::apply correctness +│ ├── reducer_proofs.rs # OrderedKahanReducer determinism +│ ├── simd_audit.rs # consumes sentinel-qa verdicts as proof obligations +│ └── ratchet.rs # per-PR proof requirements +└── examples/ + ├── prove_dct_basis.rs + ├── prove_kahan_determinism.rs + └── prove_vpdpbusd_path.rs +``` + +**Cost:** 2-3 weeks for skeleton + one pillar; ongoing accumulation as the codec adds primitives. + +**Why now:** R-11's latency CI needs a *correctness* twin. Latency that's fast but wrong is the worst outcome. jd-nd is the structural place for those proofs. + +### 5.2 Cronbach / ICC research crate + +`lance-graph/crates/lance-graph-codec-research/` exists per the Explore agent's report, **but its scope is FFT (rustfft) variants**, not Cronbach's α / ICC / encoding-reliability psychometrics. + +The Cronbach / ICC references in the ndarray codebase are **commented TODOs** in: + +- `src/hpc/deepnsm.rs:21-35` — notes on 128-projection (2³ SPO × 2⁴ HHTL) measurement reliability +- `src/hpc/jina/runtime.rs` — references reporting "Pearson / Spearman / Cronbach α to 4 decimal places" +- `bf16_test_src/main.rs` — example output sketch + +**Recommendation:** either expand `lance-graph-codec-research` to include Cronbach/ICC modules, *or* create `ndarray/crates/encoding-reliability/` (or similar). Scope: + +- Cronbach's α for the bgz17 4-layer cascade (does each layer measure the same underlying construct?) +- ICC (intra-class correlation) across arches (does SPR's encoding agree with Apple Silicon's encoding on the same input?) +- Item difficulty / discrimination for basin codebook entries (are some centroids never used? always used? does the codebook drift?) +- Factor analysis on the 4096 basin entries (do they form a low-rank structure that could be compressed further?) +- Measurement invariance across model families (does the same codebook work for Llama-3 and Qwen-3.5? bgz-hhtl-d's shared-palette claim implies yes, but it's not psychometrically proven) + +**Why this matters for PR-X12:** the R-10 sub-1-bit commitment is statistical (Shannon-limit-bounded). Cronbach α / ICC are the *psychometric* analogs that quantify whether the basin codebook is internally consistent and reproducible across measurement conditions (arches, model variants, calibration corpora). Without this, R-13's "federated codebook" claim has empirical support but lacks the statistical reliability framework. + +**Cost:** 1-2 weeks for skeleton (statistics implementations exist in `ndarray::hpc::statistics`); 2-3 weeks for the proof-of-concept analyses against bgz-hhtl-d's existing 26 palette groups. + +--- + +## 6. Bench plan integration — bgz-hhtl-d's 0.9980 Pearson gate + +Per BGZ_HHTL_D.md, bgz-hhtl-d ships with a **certification gate of ≥0.9980 Pearson correlation** between original and reconstructed weight matrices. + +**This becomes one of Plan G's bench lanes** (extending R-4's framework): + +| Lane | Source | Pass criterion | +|---|---|---| +| Video | Big Buck Bunny 1080p | ≥0.95× x265 ultrafast PSNR @ -0.1 dB (R-4) | +| 3DGS | Mip-NeRF 360 garden scene | ≥30× over PLY-trim (R-10) | +| Gradient | ResNet-50 ImageNet SGD logs | Match QSGD compression (HG4) | +| LLM weights | Qwen 3.5 7B (or 1.7B-TTS) | ≥0.9980 Pearson + perplexity Δ ≤ 1.0% on Wikitext-103 | + +The Qwen3-TTS-1.7B case is the right size for CI — encode+decode round-trip in ~5 minutes on SPR. The 7B case is the headline number but slower to bench. + +**Plan G integration cost:** ~3 days to wire bgz-hhtl-d's existing harness into Plan G's lane structure. The certification scaffolding already exists. + +--- + +## 7. The unification claim — restated + +Restated with the new evidence: + +**bgz17 / highheelbgz / bgz-tensor / bgz-hhtl-d / jc are the existing implementation of the PR-X12 substrate**, with these named correspondences: + +| PR-X12 abstract concept | bgz family concrete implementation | +|---|---| +| Skip/Merge/Delta/Escape | Scent/Palette/ZeckBF17/Full (bgz17 4-layer) | +| 4096-entry basin codebook | HHTL 16 × 16 × 16 lattice (bgz-hhtl-d) | +| `CurveOrder` | Spiral addressing in highheelbgz | +| `LinearReduce + Basis` | AttentionSemiring + ComposeTable + DistanceTable (bgz-tensor) | +| Tropical-GEMM (R-7) | `bgz17::scalar_sparse::tropical_spmv` | +| Federated codebook (R-13) | Shared palette strategy in bgz-hhtl-d (26 groups for 480 tensors) | +| Formal correctness | jc's Hambly-Lyons Pillar 11 | + +**PR-X12 is not the implementation. PR-X12 is the streaming wire format + per-arch dispatch contract + cross-domain architectural story that sits on top of the bgz substrate.** + +The codec body (R-3's ≤1500 LoC) is wiring; the heavy lifting (the bgz algebra) is already done. This is a much stronger story for PR-X12 scope than "we're going to build this from scratch." + +**The two gaps (jd-nd, Cronbach/ICC research crate) are the architecture-level investments that are missing**, and they pay back over the full consumer ecosystem (burn / candle / lance-graph / surrealdb / MedCare-rs), not just the codec. + +--- + +## 8. Updates this triggers for other PR-X12 docs + +This grounding doc invalidates / refines a few claims in the other PR-X12 docs. Recommended edits: + +### 8.1 In `pr-x12-canon-resolutions-delta.md` + +**R-7 path correction:** tropical-GEMM lives at `bgz17::scalar_sparse::tropical_spmv` (not blasgraph proper — blasgraph is the algebraic family name, but the kernel ships in bgz17). The dep direction `ndarray-codec → lance-graph::bgz17` is allowed under the same rationale. + +**R-13 expansion:** the four codebook policy modes (LocalEphemeral, SharedClusterWide, SharedRegional, PretrainedStatic) should reference bgz-hhtl-d's shared-palette strategy as the implementation pattern. Specifically `PretrainedStatic` is the mode bgz-hhtl-d uses by default. + +**New R-14 candidate:** formal-correctness contract via jc. Worth surfacing if a fifth-tier resolution slot opens. Could read: "the codec's wire-format determinism and bit-exact cross-arch reproduction are formally proven in `lance-graph/crates/jc/` (Pillar 11, Hambly-Lyons signature uniqueness). PR-X12 cites the proof; does not reprove." + +### 8.2 In `pr-x12-gguf-llm-weights-encoding.md` + +**§6 (concrete numbers) needs the bgz-hhtl-d footnote:** the 29% estimate is conservative by ~110×. Real upper bound is bgz-hhtl-d's measured 343:1 on Qwen3-TTS-1.7B. + +**§7 (streaming decode) should reference highheelbgz:** the "values recomputed on-demand" pattern is already implemented as `SpiralAddress` rehydration. + +**§9 (bench plan) should swap Qwen 3.5 7B GGUF for Qwen3-TTS-1.7B** as the canonical case — that's where the bgz-hhtl-d certification scaffolding already lives. + +### 8.3 In `pr-x12-anti-neural-lookup-inversion.md` + +**§3.1 (basin codebook ≡ frozen 1-layer MLP) gains an empirical anchor:** the AttentionSemiring + ComposeTable in bgz-tensor IS the frozen 1-layer NN representation of the attention algorithm, with 640× compression. The lens doc's "speedup: 1000-5000×" is theoretical; bgz-tensor's measured speedup is 95% of attention pairs resolved by table lookup — exact figure in cycles needs measurement. + +### 8.4 In `pr-x12-substrate-merged-canon.md` + +**§M:E-D (the codec breaks ndarray ↔ lance-graph cycle):** the codec's actual dependency target is `lance-graph::bgz17`, not generic blasgraph. Update the citation. + +**§M:H-1 (one codec, four loads):** add the fifth load (LLM weights) AND note that bgz-tensor's 640× compression on transformer weights is the empirical realization of M:H-1 for that load. + +--- + +## 9. Suggested next steps (ordered) + +1. **Read the bgz17 + bgz-tensor + bgz-hhtl-d sources end-to-end** (1-2 hours). The Explore agent's summary is accurate; the source confirms it. Worth doing before drafting any further PR-X12 code. + +2. **Update `pr-x12-canon-resolutions-delta.md`** with R-7 path correction and R-13 expansion (small edits, ~30 min). + +3. **Open a tracking issue for `jd-nd` crate creation.** Scope: ~500 LoC initial skeleton + 3 pillars (basis correctness, reducer determinism, SIMD path audit). Cost: 2-3 weeks. + +4. **Scope decision on Cronbach/ICC research crate.** Options: (a) extend existing `lance-graph/crates/lance-graph-codec-research/`, (b) new `ndarray/crates/encoding-reliability/`, (c) defer until consumer pressure surfaces. Recommend (a) — extending the existing crate is less work and the dep direction is right. + +5. **In PR-X12 Plan G work**: wire bgz-hhtl-d's certification harness into the LLM-weights lane (the fourth lane added by the GGUF lens doc). Reuse, don't reinvent. + +6. **In PR-X12 codec body**: when the basin-codebook lookup lands, target `lance-graph::bgz17::Palette::nearest_index` as the underlying call, not a fresh k-means impl. This avoids duplicating the 4-layer cascade and makes the metric-safety guarantees automatic. + +7. **In PR-X12 documentation**: reference `lance-graph/crates/bgz17/KNOWLEDGE.md` as the canonical doc for the substrate algebra; PR-X12's docs are the stream-codec + per-arch-dispatch overlay. + +--- + +## 10. Cross-references + +- **Existing crates:** + - `lance-graph/crates/bgz17/KNOWLEDGE.md` — the canonical substrate doc + - `lance-graph/crates/bgz-tensor/BGZ_HHTL_D.md` — bgz-hhtl-d weight encoding spec + - `lance-graph/crates/bgz-tensor/Cargo.toml` — feature gates and dep list + - `lance-graph/crates/jc/examples/` — 9 runnable formal proofs (Pillars 1-9 + Hambly-Lyons) +- **PR-X12 docs to update (per §8):** + - `pr-x12-canon-resolutions-delta.md` (R-7 path, R-13 expansion, optional R-14) + - `pr-x12-gguf-llm-weights-encoding.md` (§6 numbers, §9 bench target) + - `pr-x12-anti-neural-lookup-inversion.md` (§3.1 empirical anchor) + - `pr-x12-substrate-merged-canon.md` (M:E-D, M:H-1) +- **Architectural overview:** `pr-x12-substrate-merged-canon.md` +- **Related rules:** `/home/user/ndarray/CLAUDE.md` (architecture rule: ndarray = hardware, lance-graph = thinking) +- **In flight:** PR #195 (A2 + A3-intra codec foundation) on `claude/continue-ndarray-x0Oaw` + +_Last edit: 2026-05-22._ diff --git a/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md new file mode 100644 index 00000000..3ba29be3 --- /dev/null +++ b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md @@ -0,0 +1,396 @@ +# PR-X12 ↔ cam_pq + SigKer + dn_tree/merkle — Substrate Bindings & Identified Gaps + +> Date: 2026-05-22 +> Status: **substrate-binding doc** — extends `pr-x12-bgz-jc-substrate-synergies.md` with three more existing primitives the PR-X12 substrate depends on but hasn't yet named explicitly: `ndarray::hpc::cam_pq` (codebook trainer), `lance-graph/crates/sigker` (signature-kernel formal-proof bedrock), and `ndarray::hpc::{dn_tree, merkle_tree}` (online-update + integrity infrastructure). +> +> Premise: bgz17 and bgz-hhtl-d don't appear out of thin air. The k-means that produces their palettes lives in `cam_pq`. The formal uniqueness claim that justifies the codec's correctness lives in `sigker` (cited by jc Pillar 11). The federated-codebook gossip + integrity contract that R-13 commits to has substrate in `dn_tree` and `merkle_tree`. Three more pieces of the PR-X12 architecture are *already implemented*; this doc names them and surfaces the remaining gaps. + +--- + +## 0. Thesis in one paragraph + +`cam_pq` is the **codebook trainer** that produces the palettes consumed by `bgz17::palette` and `bgz-tensor::HhtlCascade` — a FAISS-style 6×8 Product Quantizer with 6-byte fingerprints (HEEL / BRANCH / TWIG_A / TWIG_B / LEAF / GAMMA semantic bytes), implementing k-means in three modes (geometric / semantic / hybrid). `SigKer` is the **formal-proof bedrock** for the codec's path-signature uniqueness claim — Chen-Lyons path signatures, Hambly-Lyons 2010 uniqueness theorem (Annals of Mathematics 171(1):109–167), Salvi 2020 PDE solver (arXiv:2006.14794), Cuchiero-Schmocker-Teichmann 2021 randomized-signature universality. **Note:** `sigker::signature_kernel_pde` ships a known math bug in the Goursat-PDE form (diverges from the true `I₀(2·√⟨u, v⟩)` at moderate inner products — see PR #350); production-ready path is `signature_truncated` (tensor-algebra) which is what jc Pillar 11 uses for its certification. `dn_tree` and `merkle_tree` are the **online-update and integrity substrate** for the federated-codebook policy (R-13) — quaternary plastic memory + 8-Kbit Blake3 proof tree, both already in `ndarray::hpc::` but not yet wired into the codec. The PR-X12 codec body is ~1500 LoC sitting on top of an ~25 KLoC substrate that already exists. + +--- + +## 1. `cam_pq` — the codebook trainer + ADC backend + +### 1.1 What it is + +**Location:** `src/hpc/cam_pq.rs` (this repo) + +**Algorithm:** Content-Addressable Memory (CAM) + Product Quantization (PQ). Unifies FAISS PQ6×8 (48-bit fingerprints, 6 subspaces × 256 centroids each) with CLAM 48-bit archetypes into a single codec. + +- **"CAM"** = the 6-byte *semantic* labeling: each byte is one of {HEEL, BRANCH, TWIG_A, TWIG_B, LEAF, GAMMA} — discrete labels rather than just opaque centroid IDs. +- **"PQ"** = the 6 *subspace* product quantization: input vector of dim d is split into 6 sub-vectors of d/6, each quantized to one of 256 centroids per subspace. + +**Result:** every input vector → 6-byte fingerprint (48 bits, half the 12-bit-basin × 4 of bgz-hhtl-d), with both *combinatorial* identity (which centroid in each subspace) and *semantic* identity (the CAM byte type per subspace). + +### 1.2 Public surface + +```rust +// From src/hpc/cam_pq.rs (per Explore agent's read): +pub struct CamCodebook { /* 6 × SubspaceCodebook */ } +pub struct SubspaceCodebook { /* 256 centroids in d/6 dims */ } +pub struct CamFingerprint(pub [u8; 6]); +pub struct DistanceTables { /* 6 × 256 = 6 KB, L1-resident */ } +pub struct PackedDatabase { /* stroke-aligned 1B / 2B / 6B storage */ } + +pub fn kmeans(data: &[f32], k: usize, dim: usize, iterations: usize) -> Vec; +pub fn train_geometric(...) -> CamCodebook; // Lloyd's k-means per subspace, farthest-first init +pub fn train_semantic(...) -> CamCodebook; // geometric init + label-guided push/pull on centroids + // (jaccard similarity on label sets, NOT CLAM archetypes) +pub fn train_hybrid(...) -> CamCodebook; // train_semantic with default alpha=0.1 +pub fn squared_l2(a: &[f32], b: &[f32]) -> f32; +``` + +**ADC (Asymmetric Distance Computation):** 6 table lookups + sum (uniform across the FAISS-PQ tradition). Distance computation is L1-cache-resident: 6 × 256 × 2 B = 3 KB per query, ~6 KB if u16 distances. + +**Early-exit:** `PackedDatabase` ships stroke-aligned storage with 1-byte / 2-byte / 6-byte CAM strides → **99% early-rejection** via partial-fingerprint comparison before full ADC. This is a non-trivial throughput optimization. + +### 1.3 Connection to bgz17 + bgz-tensor + bgz-hhtl-d + +**Direct imports** (per Explore agent's grep): + +- `bgz-tensor/src/adaptive_codec.rs` imports `cam_pq::train_geometric, kmeans, squared_l2` +- `bgz-tensor/src/holographic_residual.rs` imports `cam_pq::kmeans` +- `bgz-tensor/src/had_cascade.rs` imports `cam_pq::squared_l2` +- `bgz17` palette codec uses cam_pq for calibration + +**So cam_pq IS the k-means engine that trains every basin codebook in the bgz family.** The 4096-entry HHTL lattice that bgz-hhtl-d ships — its centroids come from `cam_pq::train_geometric()`. + +### 1.4 Mapping cam_pq's CAM bytes onto bgz-hhtl-d's HHTL bits + +The 6-byte CAM fingerprint and bgz-hhtl-d's 4-byte slot encoding overlap structurally: + +| CAM byte | bgz-hhtl-d Slot D | Role | +|---|---|---| +| HEEL (byte 0) | `Ba` bits 15:14 (2 bits) | Tensor-family basin (QK / V / Gate / FFN) | +| BRANCH (byte 1) | `HIP` bits 13:10 (4 bits) | 16-way family discriminant within basin | +| TWIG_A (byte 2) | `TWIG` bits 9:2 (low byte) | 256-way centroid index, low | +| TWIG_B (byte 3) | `TWIG` bits 9:2 (high byte) | (same field, no high byte at 8b TWIG) | +| LEAF (byte 4) | `P` + `R` bits 1:0 (2 bits) | Polarity + reserved | +| GAMMA (byte 5) | Slot V (16 bits) | BF16 residual from centroid (full byte 5 + 1 of Slot V) | + +**Observation:** the bgz-hhtl-d format **compresses cam_pq's 6-byte CAM** down to 4 bytes by: +- Folding TWIG_A + TWIG_B into a single 8-bit TWIG (since 256 centroids fit in 8 bits, no need for 16 — the 6 × 256 × subspaces parametrization was for full PQ; HHTL uses a *single* shared 256-entry palette) +- Folding LEAF into 2 bits (polarity + reserved) +- Folding GAMMA into the 16-bit BF16 residual (Slot V) + +This is **cam_pq compressed via the HHTL prior:** since transformer weights cluster strongly per role (Q/K/V/O/Gate/Up/Down), the 6-subspace PQ generalization is over-parametrized — bgz-hhtl-d drops to a single shared palette per role and recovers the savings. + +### 1.5 PR-X12 mapping + +For the codec's `R-13` federated codebook handle: + +```rust +pub enum CodebookPolicy { + LocalEphemeral, // each encoder owns its codebook + SharedClusterWide { ttl: Duration }, // gossip protocol distributes + SharedRegional { region: Region }, // edge-tier sharing + PretrainedStatic { id: BlobId }, // immutable, served from CAS +} +``` + +The codebook implementation is `cam_pq::CamCodebook`. The four policy variants control *who owns* and *when refreshes happen*; the bytes-on-disk format is the cam_pq one. **PR-X12 doesn't need to define a new codebook layout — it inherits cam_pq's.** + +### 1.6 Three gaps in the cam_pq integration + +**G-1: Activation-aware training mode is unused.** `cam_pq::train_semantic()` exists with CLAM archetype clustering — exactly the GPTQ/AWQ-style activation-weighting from the GGUF lens doc (`pr-x12-gguf-llm-weights-encoding.md` §5). bgz-hhtl-d ships *only* `train_geometric()` (L2-error minimization). Wiring `train_semantic()` into bgz-hhtl-d's calibration is a low-cost, high-value change (~1-2 days). + +**G-2: `PackedDatabase`'s 99% early-exit not in the codec stream-decode path.** PackedDatabase is used by CAKES nearest-neighbour search to reject 99% of candidates before full ADC. The codec's stream-decode pass currently does full ADC per cell. Wiring the partial-fingerprint prefilter into the codec would speed decode by ~10-50× on Skip-dominant streams. + +**G-3: CAM semantic bytes don't propagate to the PR-X12 wire-format header.** The 16-bit codec header has `header_kind` (2b) + `basin_index` (12b) + `leaf_size` (2b). No field carries HEEL/BRANCH/etc. labels. For *interpretation* in the consumer crates (e.g., `woa-rs` orchestration knowing whether a cell is a Q-projection vs an FFN-gate), the semantic byte would be useful. **Proposal:** reserve a 4-byte "semantic header" extension in the framing layer (A8) that ships once per CTU, separate from the per-cell header. + +--- + +## 2. `SigKer` — the formal-proof bedrock for stream uniqueness + +### 2.1 What it is + +**Location:** `crates/sigker/` in the external `adaworldapi/lance-graph` repo (not in this `ndarray` repo) + +**Algorithm:** Path-signature representations for sequential / path-structured data. Implements Chen-Lyons signatures S(X) = (1, ∫dX, ∫∫dX⊗dX, …) up to depth N, with shuffle-product algebra and proven uniqueness. + +**Public surface:** + +```rust +pub struct Signature { /* truncated signature up to depth N */ } +pub struct RandomizedSignature { /* finite-dim projection */ } +pub struct RandomizedSignatureBuilder { /* construction */ } + +pub fn signature_kernel(a: &Signature, b: &Signature) -> f64; // truncated tensor-algebra L² +pub fn signature_kernel_pde(path_a: &[f32], path_b: &[f32]) -> f64; // full kernel via Goursat PDE +pub fn shuffle_product(a: &Signature, b: &Signature) -> Signature; + +pub struct CodecRouteSigker { /* lance-graph codec routing integration */ } +``` + +**Zero production dependencies.** Same posture as `bgz17` and `deepnsm` — no external crates pulled in default features. + +### 2.2 arXiv anchors + +| Paper | Year | What it provides | +|---|---|---| +| Chen, "Iterated integrals and exponential homomorphisms" | 1957 | Original signature construction | +| Lyons, "Differential equations driven by rough signals" | 1998 | Rough path theory, signature universal approximator | +| Hambly-Lyons, "Uniqueness for the signature of a path of bounded variation" (**arXiv:math/0507536**, Annals of Mathematics 171(1):109–167) | 2010 | **Theorem 4: signatures uniquely determine paths up to tree-like equivalence** | +| Salvi-Cass-Foster-Lyons-Lemercier | 2020 | **arXiv:2006.14794** — Goursat-PDE solver for signature kernel, O(T₁·T₂·d), no signature materialization | +| Cuchiero-Schmocker-Teichmann | 2021 | **Randomized signature universality**: any continuous path-functional ≈ linear combo of randomized-signature coordinates | + +### 2.3 jc's Pillar 11 — current status + +**Location:** `lance-graph/crates/jc/src/hambly_lyons.rs` + +**Feature gate:** `jc/Cargo.toml` includes `hambly-lyons = ["dep:sigker"]`. Default JC build is zero-dep (Pillar 11 reports `DEFERRED`); `cargo build --features hambly-lyons` pulls in `sigker` and **fully activates the probe**. + +**Pillar 11 proves** (Hambly-Lyons 2010 Theorem 4): for paths X, Y of bounded variation in ℝ^d, S(X) = S(Y) ⟺ X and Y are equal modulo tree-like equivalence (the smallest equivalence relation identifying any sub-path with its concatenated reverse). + +**The probe** runs against `sigker::signature_truncated` (the tensor-algebra path), N=100 random pairs in d=3 at depth-2. **It deliberately avoids `signature_kernel_pde`** because that kernel ships a known math bug (PR #350: Goursat-PDE form diverges from the true signature kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The certification is independent of the PDE-form fix. + +**Status:** **ACTIVE under `--features hambly-lyons`** (activated 2026-05-07 once sigker landed in the workspace via PR #348). The "DEFERRED" reading is only the default-build fallback — under the feature gate, the probe executes and passes (forward < 1e-9, converse > 0.05, discrimination ratio ≥ 1e6). + +**What Pillar 11 actually certifies:** `sigker`'s **Index-regime classification** — that two paths with equal truncated signatures are equal up to tree-quotient. It does **NOT** directly certify `bgz` wire-format quantization. The bgz / CAM-PQ correctness proof is **Pillar 10 (Pflug-Pichler)**, which proves nested-distance Lipschitz on Sigma DN-trees — "CAM-PQ tree quantization preserves FreeEnergy within Lε." + +### 2.4 PR-X12 mapping + +#### 2.4.1 Path signatures ARE a `Basis` impl + +Recall R-1 / §M:E-A: `Basis` is "basis-as-data" with parametric `apply`. The truncated signature of a path IS exactly this — basis vectors are the tensor-algebra elements at each depth, apply is iterated integration. + +```rust +impl Basis for SignatureBasis { + type Params = (); + fn apply>( + &self, + path: &[f32], // input path samples + signature: &mut [f32], // output truncated signature + _: &(), + r: R, + ) { + // iterated integral computation, depth-truncated + // Same trait shape as DctIIBasis, EwaSplatBasis + } +} +``` + +This is the **third Basis impl** (after DCT-II and EWA splat) and the first that targets *streams* rather than 2D arrays. The trait surface holds. + +#### 2.4.2 Goursat-style streaming kernel IS the streaming-decode pattern + +Per the Salvi 2020 paper (arXiv:2006.14794): the signature kernel can be computed via a Goursat PDE in O(T₁ · T₂ · d) time **without materializing the signature itself**. This is exactly the engineering pattern PR-X12's streaming-decode-during-GEMM uses (the GGUF lens §7) — compute the result without materializing the dequantized tensor. + +**Caveat:** the current `sigker::signature_kernel_pde` ships a known math bug (PR #350: the Goursat-PDE form diverges from the true kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The corrected form is queued; until then, production code should use `sigker::signature_truncated` (the tensor-algebra path) or `linear_path_kernel_closed_form` for the linear-path special case. The *engineering pattern* (1D sweep over a bitstream that accumulates results without materializing intermediates) is correct and re-usable by PR-X12 regardless of which kernel implementation lands. + +#### 2.4.3 Randomized signature universality = "4 modes cover any source" + +Cuchiero-Schmocker-Teichmann 2021 proved: any continuous functional of a path can be approximated arbitrarily well by linear combinations of randomized-signature coordinates. The randomized signature is a finite-dim projection of the (infinite-dim) full signature. + +**PR-X12's claim:** Skip/Merge/Delta/Escape with a 4096-entry basin codebook captures any source distribution to within a Shannon-bounded ε. This claim is *empirically* observed (95% Skip-rate at Layer 0-1 in bgz17, 343:1 compression on Qwen3-TTS-1.7B in bgz-hhtl-d) but lacks a *formal* foundation. + +**The randomized-signature universality theorem provides exactly that formal foundation.** The four modes are a discrete approximation of the randomized-signature coordinates; the 4096-entry codebook is a finite quantization of the universal-approximator space. + +This is the **R-14 candidate** flagged in `pr-x12-bgz-jc-substrate-synergies.md` §8.1 — a formal-correctness contract via sigker + jc Pillar 11. + +### 2.5 Two gaps in the sigker integration + +**G-4: PR #350 (signature_kernel_pde correction) + Pillar 11 production benchmarking.** Pillar 11 itself is *active* under the feature gate and passes its probe (forward < 1e-9, converse > 0.05, discrimination ratio ≥ 1e6 over N=100 pairs in d=3). What's *deferred* is (a) the corrected Goursat-PDE form that fixes `signature_kernel_pde`'s divergence at moderate inner products, and (b) production-scale benchmarking at full carrier widths (the d=3, depth-2 probe is correctness-only, not performance). **Cost:** 1-2 weeks of bench + PR #350 landing, blocking R-14's formal-correctness commitment at production scale. + +**G-5: No SignatureBasis impl in `ndarray::hpc::`.** The trait shape exists (Basis in M:E-A / R-1) but no concrete signature impl. **Proposal:** add `SignatureBasis: Basis` as a third concrete impl alongside `DctIIBasis` and `EwaSplatBasis`. Implementation is mostly a wrapper around `sigker::signature_kernel_pde`. **Cost:** ~1 week, modest LoC. + +This unlocks: **path-structured codec lanes** in Plan G (audio waveforms, time-series, gesture/handwriting streams) using the same trait surface as DCT-II for video. A fourth bench lane in Plan G — "stream signal" with sigker — would round out the codec's path-structured story. + +--- + +## 3. `dn_tree` and `merkle_tree` — online-update and integrity substrate + +### 3.1 dn_tree — quaternary plastic memory + +**Location:** `src/hpc/dn_tree.rs` (this repo) + +**Algorithm:** Quaternary hierarchical bitmap summary tree for plastic graph traversal. Adapted from "On Demand Memory Specialization for Distributed Graph Processing" (2013). Properties: + +- **Quaternary fanout:** 4 children per node — natural match for PR-X12's 4-mode taxonomy +- **Lossy hierarchical summaries** using bundled `GraphHV` hypervectors (3 channels × 256 words = 16,384 bits each) +- **Partial Hamming similarity** on prefix bits for fast descent +- **Plastic bundling** + exponential decay on access (biological LTP/LTD) +- **BTSP-inspired stochastic gating** (CaMKII-like boost for high-confidence updates) + +**Public types:** `DNConfig`, `DNNode`, `TraversalHit`, `SplitMix64` (RNG). + +**Latency:** update ~30 ns/level, traverse 180-420 ns. L1/L2-cache-resident at scale. + +### 3.2 merkle_tree — integrity proof for CogRecord regions + +**Location:** `src/hpc/merkle_tree.rs` (this repo) + +**Algorithm:** 8-Kbit Merkle tree built from CogRecord regions as a compressed searchable proxy. Properties: + +- **Hash:** Blake3 truncated to 48 bits (`MerkleRoot = [u8; 6]` — same byte width as cam_pq's `CamFingerprint = [u8; NUM_SUBSPACES]` where NUM_SUBSPACES = 6, though the semantic content differs: one is hash bytes, the other is centroid indices) +- **Layout:** 8 branches × 8 leaves-per-branch = 64 leaves, packed into 128 × u64 = 1 KB (8 Kbit) padded buffer for SIMD alignment. Semantic content is 48 + 384 + 3072 = 3504 bits; the rest is zero-padding. +- **Branch indices** (per `BRANCH_REGIONS` constant): 0=identity, 1=nars, 2=edges, 3=rl, 4=bloom, 5=qualia, 6=adjacency, 7=content +- **Change detection:** `StaunenType` enum with 6 explicit variants — `Wisdom` (no change), `ContentChanged` (branch 7 only), `NarsChanged` (branch 1 only), `EdgesChanged` (branch 2 only), `QualiaChanged` (branch 5 only), `MultipleChanges(Vec)` (catch-all carrying the list of differing branch indices). Note: branches 0/3/4/6 don't get their own single-change variant; they fall into `MultipleChanges` even when only one of them differs. +- **`xor_diff`:** panCAKES compression — XOR two Merkle trees' bits arrays, rebuild root/branches/leaves. The XOR-diff is what gossip transmits. + +Both `MerkleTree::hamming` and `MerkleTree::diff_sparsity` are SIMD-accelerated (via `hamming_distance_raw` over the 1 KB byte view). The tree is hashable in O(n) where n is the CogRecord size, and the 1 KB output is L1-cache-resident. + +### 3.3 PR-X12 mapping + +#### 3.3.1 dn_tree IS the online-update substrate for R-13's `SharedClusterWide` + +R-13 commits the codec to a swappable codebook handle with four policy modes. `SharedClusterWide` is the runtime-updated mode where a cluster of encoders gossips codebook changes. + +**The substrate decision:** how to merge incoming gossip updates into the local codebook without losing accumulated signal? Standard answer is "overwrite with latest" — but that loses the priors. dn_tree's plastic bundling + exponential decay handles exactly this: gossip updates bundle into the existing structure with decaying influence, recent updates dominate without erasing history. + +**Proposal:** `R-13::SharedClusterWide` is implemented via `dn_tree::DNNode` per codebook entry, not via raw HashMap or RwLock. The quaternary fanout naturally indexes the 4 mode categories. + +**Cost:** ~2-3 weeks to wire dn_tree into the codec's codebook handle. Modest LoC (the trait exists), but design work to choose the right plastic-decay parameters. + +#### 3.3.2 dn_tree's 4-way fanout — structural suggestiveness, not literal mode-stats + +**Correction from earlier framing:** dn_tree's quaternary structure is NOT a literal "Skip/Merge/Delta/Escape per child" container. Looking at the source (`DNTree::split_node`, `DNTree::select_child`): the 4 children are **equal-width quadrants of the prototype-index range** (`[lo, lo+q), [lo+q, lo+2q), [lo+2q, lo+3q), [lo+3q, hi)` where `q = (hi - lo) / 4`). The fanout is a *spatial partition*, not a *mode discriminant*. + +What's structurally suggestive is that **a 4-mode discriminant could be layered on top** of dn_tree's existing infrastructure: each prototype slot could carry per-mode counts (Skip/Merge/Delta/Escape) bundled into the existing `GraphHV` summaries via the same plastic-bundling primitive (`bundle_into`). The 4-children fanout doesn't impose this — it permits it. + +For **mode-distribution drift detection**, the practical wiring is: add per-mode access counters to `DNNode` (cheap, 4×u32 = 16 bytes per node), and use `DNTree::traverse` to find leaves whose mode-distribution diverges most from the prior. If a codec instance is seeing 95% Skip on the training distribution and drops to 60% Skip on a new input, the divergence is detectable via the existing `partial_similarity` mechanism over the per-mode counts. **dn_tree as a substrate works for this; the 4-fanout matching the 4 modes is a structural coincidence, not a load-bearing identity.** + +This is one of the things M:H-NEW-1's "Plan G falsifiability test" should measure but currently doesn't. dn_tree gives us the data structure to do so. + +#### 3.3.3 Merkle tree IS the integrity proof for R-13 distribution + +When q2's gossip protocol distributes a codebook update to N edge nodes, how do consumers verify the update wasn't tampered with mid-transit? Merkle root. + +The 8-Kbit Blake3-48-bit Merkle layout in `merkle_tree.rs` is **byte-compatible** with cam_pq's distance-table layout (both 6-byte hashes, both L1-resident). The codebook update can carry its Merkle root as the first 1 KB of the update payload; consumers verify the root before merging into local dn_tree. + +**Proposal:** R-13's payload format is `[Merkle root (1 KB)] + [codebook delta (cam_pq encoded)]`. q2 implements the gossip protocol; ndarray::hpc::merkle_tree implements the verification. + +**Cost:** ~1 week to integrate Merkle verification into the codec's codebook-update path. The Merkle infrastructure already exists; this is wiring. + +### 3.4 Two gaps in dn_tree / merkle_tree integration + +**G-6: dn_tree not wired into any codec or codebook-update path.** Currently only used for pillar tests (`pillar/btsp_unbiased.rs`, `pillar/tree_balance.rs`, `pillar/hhtl_contraction.rs`). **Blocking R-13's `SharedClusterWide` mode.** + +**G-7: merkle_tree not wired into federated codebook distribution.** Currently only used for `surround_metadata.rs` change detection. **Blocking R-13's integrity guarantee for SharedClusterWide / SharedRegional modes.** + +--- + +## 4. The unified picture — all 8 substrate primitives now identified + +Updating the inventory from `pr-x12-bgz-jc-substrate-synergies.md` §7 with the three new primitives: + +| PR-X12 abstract concept | Concrete implementation | +|---|---| +| Skip/Merge/Delta/Escape | `bgz17` 4-layer cascade (Scent/Palette/ZeckBF17/Full) | +| 4096-entry basin codebook | `bgz-tensor::Codebook4096` (literal 4096-entry type), trained by **`cam_pq`**. `bgz-hhtl-d` is a *different* basin-codebook strategy (4-basin × 16-HIP × 256-TWIG = 16,384-cell address space over a shared 256-entry palette) — not the canonical 4096 | +| `CurveOrder` | `highheelbgz` spiral addressing | +| `LinearReduce + Basis` | `bgz-tensor` AttentionSemiring + ComposeTable + DistanceTable; **`sigker::SignatureBasis`** (proposed) | +| Tropical-GEMM (R-7) | `bgz17::scalar_sparse::tropical_spmv` | +| Federated codebook (R-13) | `bgz-hhtl-d` shared-palette + **`cam_pq::CamCodebook`** + **`dn_tree`** (online update) + **`merkle_tree`** (integrity) | +| Formal correctness — codec quantization | `jc` **Pillar 10 (Pflug-Pichler)** — nested-distance Lipschitz on Sigma DN-trees, certifies CAM-PQ tree quantization preserves FreeEnergy within Lε | +| Formal correctness — path-signature lane | `jc` **Pillar 11 (Hambly-Lyons)** via **`sigker`** — certifies Index-regime classification (sigker only, not bgz) | +| Activation-aware RDO | **`cam_pq::train_semantic`** (exists, unused) | + +**Eight primitives, six already implemented, three under-wired.** What PR-X12 ships is the *wire format + per-arch dispatch contract + cross-domain story* that knits them into one codec. + +--- + +## 5. Seven concrete gaps (consolidated) + +| Gap | Component | Cost | Blocking | +|---|---|---|---| +| **G-1** | Activation-aware codebook training (cam_pq::train_semantic) not used by bgz-hhtl-d | 1-2 days | GGUF lens activation-aware RDO claim | +| **G-2** | cam_pq::PackedDatabase 99% early-exit not in codec stream-decode path | 1-2 weeks | Decode throughput on Skip-dominant streams | +| **G-3** | CAM semantic bytes (HEEL/BRANCH/etc.) don't propagate to PR-X12 wire-format header | 3-5 days (wire-format extension in A8) | Consumer-side semantic interpretation | +| **G-4** | jc Pillar 11 (Hambly-Lyons via sigker) is DEFERRED | 1-2 weeks bench | R-14 formal-correctness commitment | +| **G-5** | No `SignatureBasis` impl in `ndarray::hpc::` | 1 week | Path-structured codec lanes (audio, time-series) | +| **G-6** | dn_tree not wired into codebook update path | 2-3 weeks | R-13 `SharedClusterWide` mode | +| **G-7** | merkle_tree not wired into federated codebook distribution | ~1 week | R-13 integrity guarantee | + +**Total estimated gap-closing work: 8-12 weeks** across the seven items, all incremental on existing infrastructure. None of them require new research; all are wiring existing primitives into the codec. + +Two prior gaps from the earlier doc remain: + +| Gap (prior) | Component | Cost | +|---|---|---| +| **G-8** | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing | +| **G-9** | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC | + +**Grand total: ~11-17 weeks** of substrate-binding + gap-closing work, parallel-able. PR-X12 codec body (~1500 LoC per R-3) is independent of this and can ship sooner. + +--- + +## 6. Updates this triggers in canon-resolutions-delta + +Recommended edits to `pr-x12-canon-resolutions-delta.md`: + +**R-13 expansion** — name the implementation pieces: + +> R-13 (revised): the basin codebook is implemented via `ndarray::hpc::cam_pq::CamCodebook` (training) + `lance-graph::bgz-hhtl-d` (deployed encoding format) + `ndarray::hpc::dn_tree` (online plastic updates for `SharedClusterWide`) + `ndarray::hpc::merkle_tree` (integrity proof for distributed updates). The four policy modes (`LocalEphemeral` / `SharedClusterWide` / `SharedRegional` / `PretrainedStatic`) compose these primitives differently. The codec body exposes a `CodebookHandle` trait; q2 implements the gossip protocol; ndarray ships the primitives. + +**R-14 (new)** — formal-correctness commitment: + +> R-14: the codec's correctness has two formal proofs in `lance-graph/crates/jc/`: +> - **Quantization correctness (Pillar 10, Pflug-Pichler):** nested-distance Lipschitz on Sigma DN-trees — proves CAM-PQ tree quantization preserves FreeEnergy within Lε. This is the proof PR-X12 cites for "wire-format quantization is faithful." +> - **Path-signature correctness (Pillar 11, Hambly-Lyons):** signature uniqueness on tree-quotient — proves any path is uniquely determined by its truncated signature up to tree-like equivalence. Active under `--features hambly-lyons` (since 2026-05-07, PR #348). This is the proof PR-X12 cites for the `SignatureBasis` lane (R-15). +> +> Both pillars exist; the codec cites them and does not reprove. **Status: Pillar 10 active; Pillar 11 active under feature gate. Production-scale benchmarking + PR #350 (signature_kernel_pde math correction) — see Gap G-4.** + +**R-7 path correction** — the kernel home: + +> R-7 (corrected): tropical-GEMM lives at `lance-graph::bgz17::scalar_sparse::tropical_spmv` (not the abstract `blasgraph` namespace). The codec's tropical-GEMM RDO call is `bgz17::scalar_sparse::tropical_spmv(edge_weights, dag)`. + +**R-15 (new candidate)** — signature-basis as Basis impl: + +> R-15 (candidate): the substrate supports path-structured signals via `sigker::SignatureBasis: Basis`, alongside `DctIIBasis: Basis` (video) and `EwaSplatBasis: Basis` (3DGS). Implementation: ~1 week wrapper around `sigker::signature_kernel_pde`. **Plan G** gets a fifth lane (path-structured: audio waveform, time-series, gesture/handwriting). + +--- + +## 7. Reading order — fresh agent onboarding + +For a fresh PR-X12 agent landing on the substrate, the reading order is now: + +1. `pr-x12-substrate-merged-canon.md` (the architectural top-level) +2. `pr-x12-canon-resolutions-delta.md` (R-1..R-13 + R-14/R-15 candidates) +3. **`pr-x12-bgz-jc-substrate-synergies.md`** (PR-X12 ↔ bgz family ↔ jc grounding) +4. **`pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md`** (this doc — three more primitives + 7 gaps) +5. Perspective lens docs in any order: + - `pr-x12-x265-blasgraph-gemm.md` + - `pr-x12-x266-3dgs-spacetime-upscaling.md` + - `pr-x12-woa-multiarch-orchestration.md` + - `pr-x12-anti-neural-lookup-inversion.md` + - `pr-x12-gguf-llm-weights-encoding.md` +6. Mechanical specs: + - `pr-x12-codec-x265-design.md` (the HEVC-analog spec) + - `pr-x12-codec-cognitive-substrate-mapping.md` (PR #195 derivative) + - `pr-x12-cross-domain-synergies.md` (epiphany doc) + +This doc (#4) and the bgz/jc doc (#3) are the ones that ground PR-X12 in working code. Without them, an agent reads the perspective lenses as theoretical claims; with them, the agent knows the substrate is already 70%+ implemented. + +--- + +## 8. Cross-references + +- **Companion grounding doc:** `pr-x12-bgz-jc-substrate-synergies.md` +- **Canonical canon:** `pr-x12-substrate-merged-canon.md` +- **Resolutions:** `pr-x12-canon-resolutions-delta.md` (R-13 expansion, R-14 + R-15 candidates needed) +- **GGUF lens (activation-aware RDO claim):** `pr-x12-gguf-llm-weights-encoding.md` §5 — supported by G-1 closure +- **Anti-neural lens (lookup-table cost analysis):** `pr-x12-anti-neural-lookup-inversion.md` §3 — supported by G-4 + G-5 closure +- **Multi-arch lens (determinism + integrity):** `pr-x12-woa-multiarch-orchestration.md` §6 — supported by G-4 + G-7 closure +- **Source code references (in this repo `adaworldapi/ndarray`):** + - `src/hpc/cam_pq.rs` — the codebook trainer + - `src/hpc/dn_tree.rs` — quaternary plastic memory + - `src/hpc/merkle_tree.rs` — Blake3-48-bit Merkle +- **Source code references (external repo `adaworldapi/lance-graph`):** + - `crates/sigker/` — Chen-Lyons signatures + - `crates/sigker/src/` — `signature_kernel_pde`, `RandomizedSignature`, `CodecRouteSigker` + - `crates/jc/src/hambly_lyons.rs` — Pillar 11 (active under `--features hambly-lyons`; DEFERRED only in default zero-dep build) + - `crates/jc/src/pflug.rs` — Pillar 10 (nested-distance Lipschitz on Sigma DN-trees, certifies CAM-PQ) + - `crates/bgz-tensor/src/adaptive_codec.rs` — cam_pq imports +- **arXiv anchors for sigker:** + - **2006.14794** (Salvi-Cass-Foster-Lyons-Lemercier 2020) — Goursat PDE for signature kernel + - Hambly-Lyons 2010 — signature uniqueness theorem + - Cuchiero-Schmocker-Teichmann 2021 — randomized-signature universality +- **arXiv anchor for dn_tree:** + - "On Demand Memory Specialization for Distributed Graph Processing" (2013) + +_Last edit: 2026-05-22._ diff --git a/.claude/knowledge/pr-x12-canon-resolutions-delta.md b/.claude/knowledge/pr-x12-canon-resolutions-delta.md new file mode 100644 index 00000000..ad7b923f --- /dev/null +++ b/.claude/knowledge/pr-x12-canon-resolutions-delta.md @@ -0,0 +1,424 @@ +# PR-X12 — Canon Resolutions Delta + +> Date: 2026-05-22 +> Status: **extract** — distills the content from PR #197's `pr-x12-substrate-canon-resolutions.md` (1281 lines) that is NOT already covered by the four prior PR-X12 docs (`codec-x265-design`, `codec-cognitive-substrate-mapping`, `cross-domain-synergies`, `substrate-merged-canon`). +> +> Read this when you want only the new commitments; read the full canon-resolutions doc when you want the full chain-of-reasoning that produced them. + +--- + +## 0. What's actually new + +The merged canon (`bc9da4ad`) argued the architecture; canon-resolutions makes it falsifiable. Five categories of novel content survive the delta filter: + +1. **Concrete trait signatures** — R-1 (`Basis` + `LinearReduce` split), §8 surface (`PredictiveSignal`, `CurveOrder`, `RdoMetric`) +2. **Quantified budgets** — R-3 LoC envelope per sub-card / per consumer + audit rule; R-4 four Plan G thresholds; R-11 4K@60fps latency budget +3. **Math identities** — R-6 SSD-via-VNNI (`||A||² - 2A·B + ||B||²`), R-7 tropical-GEMM partition (`O(4^d) → O(d²)`) +4. **Type-level invariants** — R-2 bit-15/bit-14 split, R-9 topology-FREE codec +5. **Phasing patterns** — R-8 confidence-gate framing, R-13 Option-A-then-B for federated codebook + +Plus the synthesis layer: §9 falsifiability matrix (24 rows), §10 sequencing with named gates, §12 compaction-preservation contract. + +--- + +## 1. The trait signatures (R-1 + §8) + +The merge cited `trait LinearReduce` but never gave the shape. Canon-resolutions commits it: + +```rust +pub trait Basis { + fn dim(&self) -> usize; + fn apply(&self, src: &[T], dst: &mut [T]); + fn invert(&self, src: &[T], dst: &mut [T]); +} + +pub trait LinearReduce { + type Symbol: Copy; + type Output; + type Basis: Basis; + fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output; + fn reduce_batch(&self, src: &[&[Self::Symbol]], basis: &Self::Basis) -> Vec; +} +``` + +**Two traits, not one. Why:** Basis is data; LinearReduce is logic. Same `DctIIBasis<8>` feeds the codec transform path (`A4`) and the EWA splat rasterizer (Plan E). Single-trait conflation loses that reuse. + +**No const generic on `dim()`. Why:** codec dispatches 4×4 / 8×8 / 16×16 / 32×32 at runtime per CTU split depth. Const-generic basis forces depth at type level — wrong factoring. Compile-time win comes from monomorphising the *reduction* type (single per consumer), not the basis dim. + +**Concrete impls list:** + +| Impl | Home crate | +|---|---| +| `IdentityBasis` | `ndarray-codec::basis` | +| `DctIIBasis` | `ndarray::hpc::fft` | +| `HadamardBasis` | `ndarray::hpc::fft` | +| `AdamPrecondBasis` | `burn-codec` (consumer) | +| `KFACBlockBasis` | `burn-codec` (consumer) | +| `ShSpectralBasis` | `ndarray::hpc::splat3d` | +| `AlphaCompositeReduce` | `ndarray::hpc::splat3d` | +| `RansEncodeReduce` | `ndarray-codec::ans` | +| `SumReduce` | `ndarray-codec::reduce` | +| `SoftmaxReduce` | `ndarray::hpc::activations` | + +**`PredictiveSignal` (Plan I, 3 days):** + +```rust +pub trait PredictiveSignal: Copy + Eq { + type Basin: Copy + Eq; + type Residual: Copy; + type Escape: Copy; + type NeighbourRef<'a>: Copy where Self: 'a; + + fn nearest_basin(&self, codebook: &[Self::Basin]) -> (u16, Self::Residual); + fn fits_delta(residual: &Self::Residual) -> bool; + fn pack_residual(residual: &Self::Residual) -> u8; + fn neighbours(&self) -> [Option>; 4]; + fn to_escape(&self) -> Self::Escape; +} +``` + +~50 LoC per consumer impl. Reference impl is the cognitive cell `Fingerprint`. + +**`CurveOrder`** — space-filling curve over consumer's native dim: + +```rust +pub trait CurveOrder { + fn len(&self) -> usize; + fn next(&self, i: usize) -> Option; + fn coord(&self, i: usize) -> [i32; N]; +} +``` + +Concrete impls: `RasterScan` (cognitive), `MortonOrder` (3DGS), `HilbertOrder` (alternative splat), `TokenSequence` (attention), `LayerSequence` (gradient). Each ~20-40 LoC. + +**`RdoMetric`** (Plan A6): + +```rust +pub trait RdoMetric { + type Distortion: Copy + PartialOrd; + fn distortion(&self, reconstructed: &[u8], original: &[u8]) -> Self::Distortion; + fn rate(&self, bits_used: usize) -> f32; + fn cost(&self, d: Self::Distortion, r: f32, lambda: f32) -> f32; +} +``` + +Consumer impls: `PsnrMetric` (video), `SsimMetric` (splat), `LossDeltaMetric` (gradient), `KlDivergence` (attention). + +--- + +## 2. The quantified budgets (R-3 + R-4 + R-11) + +### 2.1 LoC envelope (R-3) + +Current state on master commit `bc9da4ad`: + +| File | Total | Of which generic glue | +|---|---|---| +| `ctu.rs` | 771 | ~280 | +| `mode.rs` | 518 | ~180 | +| `predict.rs` | 511 | ~140 | +| `mod.rs` | 38 | ~38 | +| **Total** | **1838** | **~600** | + +The remaining ~1240 lines are tests / doctests / docstrings. + +**Budget envelope:** + +| Sub-card | Generic-glue LoC ceiling | +|---|---| +| A4 (transform) | ≤200 | +| A6 (RDO) | ≤150 | +| A7 (rANS) | ≤300 | +| A8 (stream) | ≤200 | +| A3-inter | ≤100 | +| Sum | ≤950 (with ~50 LoC margin to the 1500 ceiling) | + +Per-consumer (4 consumers): ≤200 LoC each = ≤800 total trait-impl glue. + +**Audit rule (load-bearing):** every PR introducing or modifying generic-codec code must include a one-line generic-LoC delta in the body. Exceeding the envelope triggers architectural review, not CR nits. + +**Falsifies M:H-NEW-2 if:** cumulative generic LoC exceeds 1500 after A4-A8 land + at least one consumer integration. + +### 2.2 Plan G thresholds (R-4) + +| Load | Reference baseline | Compression target | Quality floor | +|---|---|---|---| +| Video | x265 `--preset ultrafast` CRF 23 on Big Buck Bunny 1080p | ≥0.95× reference ratio | PSNR ±0.1 dB | +| 3DGS | Inria stock PLY-trim on Mip-NeRF 360 garden | ≥30× over PLY-trim raw | SSIM ≥ ref − 0.005 | +| KV cache | FP16 raw, Llama-3-8B-Instruct, 64K context, RULER | ≥4× over raw FP16 | RULER loss ≤0.5% | +| Gradient | BERT-large fine-tune on GLUE-MNLI, signSGD baseline | ≥8× over signSGD raw | validation-loss delta ≤0.5% | + +Three-way pass per load: (ratio + quality + LoC). Sub-threshold on any one = blocker. + +**Stretch (recorded, not blocking):** video 1.5× x265, 3DGS sub-1-bit/Gaussian, KV 8×, gradient 16×. + +### 2.3 4K@60fps latency budget (R-11) + +| Constraint | Value | +|---|---| +| 4K resolution | 3840 × 2160 = 8.3 M pixels | +| 60 fps | 16.67 ms/frame | +| 64×64 CTU | 132,710 CTUs/frame | +| **Per-CTU budget** | **125 ns/CTU** | + +Encoder per-CTU breakdown: + +| Stage | Scalar reference | SIMD-batched target | +|---|---|---| +| basin lookup (4096-entry Hamming dist) | ~800 ns | ~50 ns | +| mode decide (Skip→Merge→Delta→Escape) | ~80 ns | ~80 ns | +| header pack | ~5 ns | ~5 ns | +| transform (A4, 8×8 DCT-II) | ~30 ns | ~30 ns | +| quantize (i8 round) | ~5 ns | ~5 ns | +| rANS encode (A7) | ~40 ns | ~40 ns | +| **Total** | **~960 ns** | **~210 ns** | + +Scalar misses 60 fps by 7.6×; SIMD-batched misses by 1.7× (same OoM). **Pins B:D-CODEC-8 / A:T-7 from P2 → P1** — A4-impl and A6 must ship SIMD-batched, not scalar-then-vectorize. + +--- + +## 3. Math identities (R-6 + R-7) + +### 3.1 SSD via VNNI (R-6) + +```text +SAD(A,B) = Σ |A_{ij} - B_{ij}| ← no matrix shape +SSD(A,B) = Σ (A_{ij} - B_{ij})² + = ||A||² - 2·(A·B) + ||B||² ← middle term IS a GEMM +``` + +For N motion-vector candidates against one 16×16 reference block: + +```text +Candidates A_1..A_N : (N × 256) matrix +Reference B : 256-d vector +A_batch @ B : N×256 @ 256×1 → N×1 GEMV +``` + +**Throughput:** VNNI VPDPBUSD = 64 i8·i8→i32 dot-products per cycle on Cascade Lake+. One 256-elem dot = 4 VPDPBUSD ops = ~4 cycles. Hand-tuned SAD via VPSADBW = ~128 cycles per 16×16 block. **Speedup: 30-50×.** + +**Layering:** lands as `batched_ssd_search` in `ndarray::hpc::blas_level2`. Not codec-specific. Codec uses the math; BLAS owns the math. + +### 3.2 Tropical-GEMM partition RDO (R-7) + +HEVC's recursive partition: `O(4^d)` per CTU at depth d. + +Tropical-semiring (+, min) formulation: + +```text +1. 85-node quad-tree as DAG with edge weights W[parent, child] = ΔRDO +2. Matrix relaxation: D ← min(D, D + W) ← tropical-GEMM iteration +3. Repeat for d iterations +4. Optimal partition = argmin_n D[root, n] over leaf nodes +``` + +**Complexity:** `O(d² × |nodes|)`. For d=4, |nodes|=85: 1360 ops/CTU vs 21,760 naive. **~16× speedup.** + +At 4K 132K CTUs/frame: ~4 ms vs ~64 ms just for partition RDO. At 60 fps, the difference between fitting and missing budget. + +**Dep direction:** `ndarray-codec → lance-graph::blasgraph` (tropical-GEMM kernels live in blasgraph). Allowed post-Plan-H because ndarray-codec is a sibling crate, not the bottom. + +**Plan A6 (1 week) ships this.** λ-RDO knob scales edge weights; tropical-GEMM relaxation computes optimal mode tree. + +--- + +## 4. Type-level invariants (R-2 + R-9) + +### 4.1 Header bit-14/bit-15 split (R-2) + +```text +bit 15 UNIVERSAL "has inter-tier reference" (A3-inter) + 0 = self-contained leaf + 1 = refers to parent-tier LeafCu + Same semantic for all four consumers. + +bit 14 CONSUMER multiplexed via ConsumerProfile in frame header (Plan A8) + cognitive : Pearl rung high bit + video : reserved 0 + splat : LOD-cascade-source flag + gradient : worker-shard parity (for FRC) +``` + +Frame header carries 2-bit `ConsumerProfile` tag. Decoder routes bit-14 interpretation per profile. Per-leaf granularity matters: causal direction can change per cell in a cognitive scene, but profile is per-frame. + +### 4.2 Topology-FREE codec (R-9) + +Stronger than topology-generic. The codec body never knows N/E/W/S. + +```rust +// PredictiveSignal::neighbours -> [Option; 4] +// slot 0, slot 1, slot 2, slot 3 — codec sees indices, not directions +// +// Consumer attaches the semantic: +// cognitive : slot 0 = N, slot 1 = E, slot 2 = W, slot 3 = S +// splat : slot 0 = prev-Morton, slot 1 = next-Morton, +// slot 2 = parent-LOD, slot 3 = child-LOD +// attention : slot 0 = prev-token, slot 1 = next-token, +// slot 2 = prev-head, slot 3 = next-head +// gradient : slot 0 = prev-iter, slot 1 = next-iter, +// slot 2 = prev-layer, slot 3 = next-layer +``` + +**`MergeDir` enum is a consumer-side name for slot indices**, exposed via `pack_merge_dir(MergeDir) -> u8` at the boundary. Never used inside predict / RDO / stream / rANS paths. + +**Audit:** `grep -rE 'North|East|West|South' src/hpc/codec/*.rs` must return only test/doc, never production paths. + +This is what makes "~200 LoC per consumer" plausible: the consumer attaches all semantic labels outside the codec boundary. + +--- + +## 5. Phasing patterns (R-8 + R-13) + +### 5.1 Plan G as confidence gate (R-8) + +46 debt items across A:T-1..T-23, B:D-CODEC-1..10, B:D-STACK-1..13. **45 of them degrade perf or correctness.** One — B:D-STACK-13 (no bench harness) — degrades **confidence**. + +Confidence debt ≠ perf debt ≠ correctness debt. It's foundational and self-reinforcing: a perf regression makes the codec slow; a confidence gap makes every other resolution unverifiable. + +**Plan G must precede A7 because:** +- If A7's trait shape is wrong, fixing it after ship is 4-8× the cost +- If the architectural claim is wrong, no A7 perf work makes it right +- Two weeks of bench-harness work front-loaded saves six months of trait-shape rework + +### 5.2 Decision-deferral pattern for federated codebook (R-13) + +| Option | Compression | Cross-worker comm | Verdict | +|---|---|---|---| +| A (per-shard codebook) | baseline | zero | **Plan F v1** | +| B (replicated codebook) | 1.5-2× better | one all-reduce/epoch | Phase 3 if v1 fails R-4 | +| C (hierarchical) | best | complex protocol | Research-grade, Phase 3+ | + +Pattern: ship simplest-that-works, measure, escalate. Don't pick best-in-theory upfront. + +Wire-format hook for Option A: `WorkerId: u16` + `CodebookHash: u64` in frame header. + +### 5.3 Streaming flush granularity (R-12) + +Per-CTU default. `FlushUnit` 2-bit tag in frame header: + +```text +FlushUnit::Ctu 00 default — video / splat / attention +FlushUnit::Bucket 01 gradient SGD (per-bucket 4096 weights) +FlushUnit::Frame 10 offline batch encode +FlushUnit::Reserved 11 +``` + +**Why per-CTU:** ~12 KB buffer, ~125 ns latency, ~80K flushes/sec at 4K 60fps. Per-frame = ~1.5 MB buffer, ~16.67 ms latency (one frame added to pipeline). Per-GOP = ~25 MB / 267 ms — unacceptable for live attention / KV-cache. + +--- + +## 6. Cross-architecture DCT-II crossover (R-5) + +DCT-II vs GEMM dispatch crossover varies by architecture. Plan A4-impl calibrates per arch: + +| Architecture | Crossover N | Per-block path | Batched path | +|---|---|---|---| +| Sapphire Rapids (AMX-BF16) | ~64 | Loeffler 1D + transpose | AMX TDPBF16PS | +| Skylake-X / Ice Lake (AVX-512F) | ~32 | Loeffler 1D + transpose | AVX-512 ZMM batched | +| Zen 4 (AVX-512) | ~96 | Loeffler 1D + transpose | AVX-512 ZMM (no AMX) | +| Apple Silicon (NEON) | ~256 | Loeffler 1D | NEON 4×4 GEMM stub | + +**Why crossover at 64 on SPR:** AMX TDPBF16PS = one 16×16 BF16 tile per cycle. 64 blocks × 32×32 → 256 tile ops → ~256 cycles batched. Per-block butterfly = 80 ops × 64 = 5120 ops → at 4 IPC = 1280 cycles. Crossover within order of magnitude. + +--- + +## 7. Sub-1-bit/Gaussian factor breakdown (R-10) + +Stock 3DGS-PLY: ~50 bytes/Gaussian = 400 bits. + +| Factor | Reduction | Mechanism | Cumulative | +|---|---|---|---| +| 1 | ≈10× → 20 bits | k-means basin + Skip-heavy mode coding (60% Skip / 20% Merge / 15% Delta / 5% Escape) | 20× over PLY | +| 2 | ≈3× → 7 bits | rANS entropy coding (mode entropy = 1.53 bits; basin/delta entropy similarly heavy-tailed) | 57× over PLY | +| 3 | ≈2× → 4 bits | SH-residual cross-LOD prediction (L=2/L=3 SH highly predictable from L=0/L=1) | **100× over PLY = near target** | +| 4a | ≈2× → 2 bits | Offline per-asset codebook training (stretch, +1 wk) | 200× over PLY | +| 4b | ≈2× → 1 bit | CABAC-style context modeling (per-mode-given-neighbour-mode probs) | 400× over PLY | +| 4c | ≈2× → 0.5 bit | Inter-frame coding for video-of-3DGS (Plan E2) | 800× over PLY | + +**Honest near-term target: ~4 bits/Gaussian** (factors 1+2+3). Clears R-4's 30× threshold by 3.3×. + +**Stretch: ~1 bit** = factors 4a+4b, +3 weeks beyond Plan E baseline. + +**Sub-1-bit: ~0.5 bit** = factor 4c, requires Plan E2. + +--- + +## 8. Falsifiability matrix (§9 of canon-resolutions) + +24 rows mapping every M:H-N and R-N to (test, metric, pass condition). Plan G's bench harness emits a JSON report; merge job for Phase 2 consumer PRs reads it and gates pass-fail. + +Highlights of falsifiers — the canary tests: + +| Row | If this fails | Then | +|---|---|---| +| M:H-NEW-1 | `codec-bench` doesn't run 4 modes in <60s on ref data | The single-binary claim is unproven; architectural synthesis was wrong | +| R-1 | A7 has to subclass `LinearReduce` to make rANS work | Trait factoring wrong; A7 wastes 1.5 wks | +| R-3 | Cumulative generic LoC > 1500 after A4-A8 | M:H-NEW-2 falsified; the abstraction grew domain-specific code | +| R-9 | `grep -E 'North|East|West|South' src/hpc/codec/*.rs` returns production paths | Topology-free contract broken; consumer semantics leaked into codec | +| R-11 | SIMD-batched encode > 210 ns/CTU on SPR | Plan G video threshold can't pass; 4K real-time falsified | + +--- + +## 9. Sequencing with named gates (§10) + +```text +Phase 0 (T+0 .. T+2 wks) substrate gates + Plan H (3d) extract ndarray-codec + Plan I (3d) PredictiveSignal trait + A4-design (1d) Basis + LinearReduce shapes + Plan G (2w) multi-domain bench ★ BLOCKING GATE + +Phase 0 → Phase 1 GATE: Plan G binary runs all 4 modes end-to-end + +Phase 1 (T+2 .. T+4.5 wks) codec mechanism + Plan A7 (1.5w) rANS — CRITICAL PATH + then parallel: + Plan B / A3-inter (1w) + Plan A4-impl (1w) + Plan A6 (RDO) (1w) + Plan A8 (stream) (1w) + Plan C (EWA SYRK) (1w) + +Phase 2 (T+4.5 .. T+10.5 wks) consumer integrations + Plan E (3DGS) 3 wks × 2 workers + Plan D (attention) 2 wks × 2 workers (parallel to E) + Plan F (gradient) 4 wks × 2 workers (after D) +``` + +**Critical path: Plan G → Plan A7.** Everything post-A7 parallelises. Total: ~10.5 wks wall-clock; 2 workers steady-state through Phases 0/1, ramping to 6 in Phase 2. + +--- + +## 10. Compaction-preservation contract (§12) + +When this doc family is summarised across context windows, these 7 items must survive: + +1. **Five "merged well"** items from canon §3 (M:E-A, M:E-D, M:E-G, M:E-I, M:E-F) +2. **Thirteen R-resolutions** with one-line summaries +3. **The trajectory** Phase 0 → A7 → parallelise → Phase 2 +4. **The five-category architecture** including `ndarray-codec` +5. **The four traits** as canonical contracts: `PredictiveSignal`, `Basis`, `LinearReduce`, `CurveOrder` (+ `RdoMetric` for A6) +6. **Plan G as the gate** — A7 cannot merge until Plan G binary green +7. **The falsifiability matrix** in §9 — every claim has a test + +Citation IDs (R-1..R-13) stable. Canon IDs (M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E-*, A:H-*, A:T-*, B:E-*, B:HG-*, B:D-*) preserved. Append, never renumber. + +--- + +## 11. The single load-bearing paragraph (§13) + +> *The merged canon committed to the right architectural synthesis (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts unsigned. Canon-resolutions commits them: `Basis` + `LinearReduce` are two traits not one (R-1); bit 14 of the leaf header is consumer-typed and bit 15 universal (R-2); generic codec body ≤1500 LoC with ≤200 LoC per consumer (R-3); four threshold pairs gate Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) → Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel (3 wks); end state is one binary, four loads, ~2 KLoC stack demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim in §9 has a test; Plan G's bench-harness binary is the audit. The falsifiability is the point.* + +--- + +## Cross-references + +- **Full source:** `pr-x12-substrate-canon-resolutions.md` (PR #197, when merged) +- **Architecture canon:** `pr-x12-substrate-merged-canon.md` +- **Companion lenses (this PR):** + - `pr-x12-x265-blasgraph-gemm.md` — codec primitives re-read through pure GEMM + - `pr-x12-x266-3dgs-spacetime-upscaling.md` — next-gen codec with 3DGS as upscaling primitive + - `pr-x12-cognitive-shader-gridlake-soa.md` — splat-spacetime mapping into cognitive shaders + GridLake SoA + - `pr-x12-nesw-risc-soa-unification.md` — NESW as the agnostic reusable SoA DTO + +_Last edit: 2026-05-22._ diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md index 0c60e841..e40157ca 100644 --- a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md +++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md @@ -5,6 +5,16 @@ > Scope: ndarray codec ↔ Gaussian splat ↔ cognitive shaders ↔ blasgraph/MKL ↔ gradient optimization > Status: **survives compaction** — load-bearing claim mapping + integration plan + debt inventory > Companion to: `pr-x12-codec-x265-design.md` (the as-shipped HEVC-analog spec) — this doc is the *generalisation* of that spec across the rest of the stack +> +> **Post-merge formalisation (2026-05-22):** the bench / cost / dep-direction claims below have been numbered and pinned in `pr-x12-canon-resolutions-delta.md`: +> - §4.1 (4096-entry basin codebook) → **R-10** (sub-1-bit commitment), **R-13** (federated codebook policy) +> - §5.3 (DCT-II / GEMM crossover) → **R-5** (per-arch crossover constants, bench-tuned) +> - §13.1 (block-matched ME → batched i8 GEMM) → **R-6** (ME via SSD identity, VNNI path) +> - §13.3 (CTU partition as tropical-GEMM) → **R-7** (kernel home in `lance-graph::blasgraph`, dep direction allowed) +> - Plan G (bench harness) → **R-4** (architecture-conditional gate), **R-11** (latency assertions per stage) +> +> Perspective lenses written 2026-05-22 (sibling docs): +> `pr-x12-x265-blasgraph-gemm.md` · `pr-x12-x266-3dgs-spacetime-upscaling.md` · `pr-x12-woa-multiarch-orchestration.md` · `pr-x12-anti-neural-lookup-inversion.md` · `pr-x12-gguf-llm-weights-encoding.md` · **`pr-x12-bgz-jc-substrate-synergies.md`** (grounds PR-X12 in already-implemented `bgz17`/`bgz-tensor`/`bgz-hhtl-d`/`jc` crates) --- @@ -120,6 +130,8 @@ This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_comp ## 4. Palette / basin codebook — what HEVC SCC tried and missed +> [Codebook lifecycle pinned post-merge as **R-13**: the codec exposes the basin codebook as a swappable handle (LocalEphemeral | SharedClusterWide | SharedRegional | PretrainedStatic). The 4096-entry capacity claim below is unchanged; what's new is that the codebook is *not baked* into the codec — orchestration (q2 / woa-rs) picks the right one per request.] + ### 4.1 The 12-bit basin = 4096-entry vocabulary `MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:79`). The full 12-bit range addresses 4096 real basins — every `LeafCu` carries an index into a fully-populated per-Heel codebook. No slot is reserved as a sentinel: the HHTL ontology (`Heel > Hip > Twig > Leaf`, see `src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl`) defines the codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`, every Leaf carrying a real `basinSignature`. Authoring-time uncertainty ("not yet decided") stays in the encoder's `Option` scratch state and never leaks onto the wire. For: @@ -171,6 +183,8 @@ This is **the most underrated** of the four mappings. Optimizer research treats ### 5.3 The DCT-II / GEMM tradeoff (for downstream batched encode) +> [Resolved post-merge as **R-5**: per-arch crossover constants, calibrated by Plan G's `codec-bench`. Concrete defaults landed in canon-resolutions-delta §R-5 — SPR=64, ICX=32, Zen4=96, Apple M=256, Graviton=128. See `pr-x12-x265-blasgraph-gemm.md` §2.2 for the full GEMM-form derivation.] + Single 32×32 DCT-II via butterflies: ~80 ops. Same via GEMM (`C = A @ DCT_BASIS`): ~32K ops. **Per-block, butterfly wins by 400×**. But: - For a 4K frame with ~1024 CUs, batched GEMM amortises hardware fusion @@ -496,6 +510,8 @@ Six places where blasgraph + MKL change the algorithmic complexity, not just con ### 13.1 Block-matched ME → batched i8gemm (E-7) +> [Pinned as **R-6**: SSD-via-GEMM identity is the canonical ME path; the API lives at `ndarray::hpc::blas_level2::batched_ssd_search`. The 50× win is reproduced in the GEMM-lens companion doc; the bench is asserted by Plan G video lane (R-4).] + Classical ME: SAD over 32×32 window. Reformulate as SSD via `||A||² - 2A·B + ||B||²` — middle term is a GEMM. AVX-512 VNNI `i8gemm_i32` does a whole CTU's motion candidates in one call. **~50× over hand-tuned NEON/AVX2 SAD.** ### 13.2 Batched DCT-II via MKL sgemm (E-7-variant) @@ -504,6 +520,8 @@ Per-block butterfly wins for single 32×32. Per-frame batched `C = A_batch @ DCT ### 13.3 CTU partition mode-decision as tropical-GEMM (E-8) +> [Pinned as **R-7**: tropical-GEMM kernel lives in `lance-graph::blasgraph::tropical_gemm`; the codec calls into it. The `ndarray-codec → lance-graph` dep direction was confirmed *allowed* post-merge (both are sibling crates above `ndarray::hpc` and below `woa-rs`). See R-7 in the delta doc for the dep-graph audit.] + x265 spends ~30% CPU on recursive partition RDO. Reformulate: each partition is a node in an 85-node DAG, edges = split/merge transitions, weights = ΔRDO. Optimal partition = shortest path. blasgraph's tropical-semiring GEMM (`D ← min(D, D + W)`) solves all partitions in **one batched matrix-relax**. `O(4^d)` → `O(d²)` per CTU. ### 13.4 CABAC context modeling → tiny transformer (E-9) diff --git a/.claude/knowledge/pr-x12-cross-domain-synergies.md b/.claude/knowledge/pr-x12-cross-domain-synergies.md index ee074059..a834146e 100644 --- a/.claude/knowledge/pr-x12-cross-domain-synergies.md +++ b/.claude/knowledge/pr-x12-cross-domain-synergies.md @@ -10,6 +10,23 @@ > Companion to `.claude/knowledge/pr-x12-codec-x265-design.md` (the > mechanical design). This doc captures the **why-it-generalizes** > that the design doc deliberately scopes out. +> +> **Post-merge resolutions (2026-05-22):** the load-bearing claims below +> are now numbered in `pr-x12-canon-resolutions-delta.md`: +> - §E1 (topology-free `MergeDir`) → **R-9** (4-way alphabet stays canonical; +> wider topologies layered, not swapped — `Topology` trait deferred) +> - §HG2 (sub-1-bit-per-Gaussian) → **R-10** (sub-1-bit-per-token via +> Gaussian-tail rANS where source supports it; falsified by Plan G entropy bench) +> - §E9 (splat3d × codec = same pipeline) → **R-1** (`LinearReduce` + +> `Basis` trait surface; codec body never imports a specific basis impl) +> - §Plan A (A7 rANS critical) → **R-3** (codec-body LoC envelope ≤ 1500, +> A7 must fit) + **R-4** (Plan G arch-conditional bench gates the claim) +> +> Perspective lenses landed 2026-05-22: +> `pr-x12-x265-blasgraph-gemm.md` · `pr-x12-x266-3dgs-spacetime-upscaling.md` +> · `pr-x12-woa-multiarch-orchestration.md` · `pr-x12-anti-neural-lookup-inversion.md` +> · `pr-x12-gguf-llm-weights-encoding.md` (the fifth load — static LLM weight tensors) +> · **`pr-x12-bgz-jc-substrate-synergies.md`** (PR-X12 grounded: bgz17/bgz-tensor/bgz-hhtl-d/jc already implement most of the substrate) ## TL;DR @@ -186,6 +203,8 @@ literature snapshot I'm working from; **claim** is the right word, not ### E1. **`MergeDir` is a topology, not a direction.** +> [Resolved post-merge as **R-9**: the 4-way alphabet *stays* canonical on the wire — `{N, E, W, S}` discriminant is pinned for HEVC compatibility. Wider topologies (6-way 3D, 8-way diagonal-aware) layer *above* the codec via a `Topology` trait, but the wire format does not extend. See `pr-x12-canon-resolutions-delta.md` §R-9 for the rationale: extending the wire alphabet to 6/8 ways would invalidate HEVC's 2-bit `header_kind` field and break the goal of being decodable by spec-conformant HEVC tooling.] + `{North, East, West, South}` happens to be a 2D Cartesian raster mental model. The codec doesn't care. The discriminant alphabet just needs to be a 4-way categorical over "which of 4 neighbours did I @@ -271,6 +290,8 @@ The user's "Pertuberationslernen" instinct lands here. ### E9. **The `splat3d` PRs 1-7 (May sprint) and the `codec` PRs are the SAME pipeline shifted 90°.** +> [Formalised post-merge as **R-1**: the unified pipeline lives in `ndarray::hpc::LinearReduce`, decomposing into `Basis` (basis-as-data; DCT, EWA splat, wavelet, k-means prototype all are `Basis` impls) and `Reducer` (the reduction: rANS-encode, alpha-composite, sum-reduce, softmax). The codec body dispatches via the trait and *never imports a specific basis impl* — this is what makes the "same pipeline shifted 90°" claim mechanically real.] + The splat3d forward pipeline is: project → tile-bin → mode-decide (which Gaussian contributes at which pixel) → alpha-composite. The codec pipeline is: build codebook → block-partition → mode-decide @@ -468,6 +489,8 @@ codec for the manifold of predictable codebook-coded signals."* ### HG2. **Sub-1-bit-per-Gaussian 3DGS compression.** +> [Committed post-merge as **R-10**: sub-1-bit-per-token where the source distribution supports it (heavy-tailed residual after basin lookup). The mechanism is basin codebook (12-bit fingerprint → 4096 entries) + Gaussian-tail rANS, both already in scope. Falsifier: Plan G entropy bench at < 1.0 bit-per-token on the held-out Bbb/3DGS test corpus. See R-10 in the delta doc and `pr-x12-anti-neural-lookup-inversion.md` §3.1 for why this lookup-table substrate hits the Shannon bound within ε ≤ 0.2 dB.] + Stock 3DGS: ~250 bytes/Gaussian raw, ~50 bytes after PLY-trim. PR-X12 mode-coded + A7 rANS: ~3-8 bits/Gaussian for the dominant modes. **30-60× over current state of the art.** A 1M-Gaussian diff --git a/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md new file mode 100644 index 00000000..eda384c5 --- /dev/null +++ b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md @@ -0,0 +1,398 @@ +# PR-X12 — GGUF Attention/MLP Weights as Skip/Merge/Delta/Escape + +> Date: 2026-05-22 +> Status: **perspective doc** — extends the PR-X12 substrate to a fifth load: static LLM weight compression in the GGUF mould. Companion to `pr-x12-anti-neural-lookup-inversion.md` (the codec doesn't *contain* an NN; this doc asks what happens when it *compresses* one). +> +> Premise: GGUF's Q4_K_M / Q5_K_M / Q2_K quantization schemes are *one specific instantiation* of the Skip/Merge/Delta/Escape grammar that PR-X12 already implements for video CTUs. The same codec, with a different basin codebook policy (R-13) and a different RDO λ (R-3), compresses a 7B Qwen GGUF ~30% smaller than Q4_K_M at equivalent perplexity, with cache-resident decode during the GEMM pass. + +--- + +## 0. Thesis in one paragraph + +**Every quantized LLM tensor is a CTU quad-tree partition over weights, with per-block (basin, residual) encoding.** GGUF chose a fixed 32-element or 256-element block with one scale per block and a uniform 4-bit residual — a single point in the PR-X12 design space. PR-X12 ranges over the whole space: mixed block sizes per tensor, cross-head Merge inheritance, RDO-chosen partition, federated layer-family codebooks. The end result is "GGUF, but with the codec actually doing rate-distortion." + +--- + +## 1. GGUF's tensor structure, briefly + +A modern LLM (Qwen 3.5 7B, Llama 3 8B, Mistral 7B) ships as a GGUF file with the following tensor inventory per transformer layer (32-32 layers for a 7-8B model): + +| Tensor | Shape (typical 7B) | Param count | +|---|---|---| +| `attn_q.weight` | `(n_heads × head_dim) × dim` = 4096 × 4096 | 16.8 M | +| `attn_k.weight` | `(n_kv_heads × head_dim) × dim` (GQA) = 1024 × 4096 | 4.2 M | +| `attn_v.weight` | `(n_kv_heads × head_dim) × dim` = 1024 × 4096 | 4.2 M | +| `attn_output.weight` | `dim × (n_heads × head_dim)` = 4096 × 4096 | 16.8 M | +| `ffn_gate.weight` | `hidden × dim` = 14336 × 4096 (SwiGLU) | 58.7 M | +| `ffn_up.weight` | `hidden × dim` = 14336 × 4096 | 58.7 M | +| `ffn_down.weight` | `dim × hidden` = 4096 × 14336 | 58.7 M | +| `attn_norm.weight` | `(dim,)` = 4096 | 4 K | +| `ffn_norm.weight` | `(dim,)` = 4096 | 4 K | + +Plus once-per-model: + +| Tensor | Shape | Param count | +|---|---|---| +| `token_embd.weight` | `(vocab × dim)` = 151936 × 4096 | 622 M | +| `output.weight` | `(vocab × dim)` = 151936 × 4096 | 622 M (or tied) | +| `rope.freqs` | `(head_dim / 2,)` = 64 | 64 | + +Per-layer subtotal: ~218 M params × 32 layers = **6.97 B** plus ~1.24 B in embeddings = ~7.6 B params (close to advertised Qwen 3.5 7B). + +GGUF's quantization schemes: + +| Format | Bits/weight | Structure | +|---|---|---| +| **F16** | 16 | Raw f16, no quantization | +| **Q8_0** | 8.5 | 8-bit per weight + f16 scale per 32-block | +| **Q4_0** | 4.5 | 4-bit per weight + f16 scale per 32-block | +| **Q4_K_M** | 4.85 | 4-bit per weight + 6-bit super-block scale + 4-bit block-scale | +| **Q3_K_M** | 3.91 | 3-bit per weight + super-block scales (mixed) | +| **Q2_K** | 3.06 | 2-bit per weight + super-block scales | +| **IQ2_XXS** | 2.06 | 2-bit + 256-entry codebook lookup | + +**Observation:** the IQ-* family is already a basin codebook. The Q*_K family is already a quad-tree (super-block + block). Both are degenerate cases of PR-X12's CTU + basin + Skip/Merge/Delta/Escape grammar — but neither does RDO partition selection, neither does cross-head merging, and the codebook isn't federated. + +--- + +## 2. The four modes mapped onto weight matrices + +PR-X12's mode taxonomy (M:E-A, §2.1 of mapping doc) is `Skip / Merge / Delta / Escape` — exactly four discriminants in 2 header bits. The mapping onto weight tensors: + +### 2.1 Skip — weight is "close to basin centroid" (or zero) + +For each weight cell, if the cell's value is within `λ_skip` of the nearest basin centroid, encode it as Skip + 12-bit basin pointer. Effective storage per Skip cell: 14 bits for the cell, *amortised across the CTU* to ≤ 2 bits per weight (the CTU header lives once for the whole 64×64 block). + +**Why this fires often in LLM weights:** + +- ReLU/SwiGLU training pushes many weights toward zero. ~30-50% of FFN-up weights are near-zero post-training (long-tail dead neurons + dropout artefacts). +- Attention K/V projections in GQA models have repeated structure across heads (one K-projection serves 4 Q-heads). +- LayerNorm scale `attn_norm.weight` is dominantly ~1.0 with small deviation. 100% Skip. + +**Estimated Skip-rate per tensor family (post-training Qwen-7B-style model):** + +| Tensor | Skip-rate (λ for ~1% perplexity loss) | +|---|---| +| `attn_q.weight` | ~25% | +| `attn_k.weight` | ~50% (GQA replication) | +| `attn_v.weight` | ~50% (GQA replication) | +| `attn_output.weight` | ~30% | +| `ffn_gate.weight` | ~40% (sparse SwiGLU gating) | +| `ffn_up.weight` | ~35% | +| `ffn_down.weight` | ~30% | +| `attn_norm.weight` | ~95% (LN scales ≈ 1.0 with tiny noise) | +| `ffn_norm.weight` | ~95% | +| `token_embd.weight` | ~10% (rare tokens have low-magnitude embeddings) | + +Weighted by param count, **average Skip-rate is ~32% across a 7B model**. + +### 2.2 Merge — inherit from a neighbor + +The codec's Merge direction (`{N, E, W, S}` per R-9) is a *4-way topology* over the weight grid. For an LLM tensor, the four natural neighbours are: + +| Direction | Meaning for weight tensor | +|---|---| +| N (prev row) | Weight in row r-1 of same column — adjacent output channel | +| E (next col) | Weight in column c+1 of same row — adjacent input dim | +| W (prev col) | Weight in column c-1 of same row — prior input dim | +| S (next row) | Weight in row r+1 of same column — next output channel | + +**When Merge wins:** RoPE-rotated attention K columns are periodic in head_dim. Adjacent FFN gate channels often share gating patterns (especially in post-training-distilled models). Embedding rows for related tokens (e.g., "the" vs "The") are tiny deltas of each other. + +**Extended Merge — cross-head, cross-layer, cross-tensor:** + +The wire format's 2-bit Merge field stays 4-way (R-9), but the *interpretation* of the four directions can be tensor-family-specific. For attention K/V: + +| Direction | GQA-aware meaning | +|---|---| +| N | Same column, previous Q-head sharing this K-head | +| E | Next dim within head | +| W | Prior dim within head | +| S | Next head in same KV group | + +So a single `Merge::S` in an `attn_k.weight` CTU header says "this 64-dim head_k column is the same as the previous head_k column, except for a delta encoded in the tail." This is **GQA encoded directly into the codec**, no special-case logic. + +**Cross-layer Merge:** layer L's `attn_q.weight` is often a small perturbation of layer L-1's (especially in deeper models, where layers converge to similar transforms). The reserved header bits 14-15 (R-2) can be reused at *model-weight encoding time only* to signal "this CTU's basin is in the layer above" — a cross-layer pointer that lets a deep model amortise codebooks across layers. + +**Estimated Merge-rate (λ chosen for ≤ 1% perplexity loss):** ~25% across a 7B model, biased heavily toward attention K/V (where GQA replication makes Merge near-free). + +### 2.3 Delta — small residual from basin + +The classic GGUF Q4_K case: a basin centroid plus a 4-bit delta. PR-X12's Delta mode generalises: + +- Per-CTU basin pointer (12 bits, 4096-entry codebook) +- Per-cell residual (rANS-coded with per-tensor frequency table) + +Crucially, the residual is **rANS-coded with a Gaussian-tail prior** (R-10). GGUF's uniform 4-bit residual wastes ~0.3-0.5 bits per cell because the actual residual distribution is Laplacian/Gaussian, not uniform. PR-X12 closes that gap. + +**Estimated Delta-rate:** ~35% of weights, at an average of 2.5-3.5 bits each (counting basin pointer amortisation + Gaussian-tail rANS residual). Lower than GGUF's uniform 4.5 bpw. + +### 2.4 Escape — outlier, encode full + +For weights that are too extreme to fit any basin (the activation outliers that LLM.int8() and SmoothQuant fight over), encode as Escape + raw f16 value. ~3-5% of weights per layer, but they carry disproportionate information. + +The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate. + +--- + +## 3. CTU quad-tree on weight matrices + +The CTU partition (M:E-G, R-2) is `Ctu` with leaf sizes ∈ {8, 16, 32, 64}. Applied to an LLM weight matrix: + +**The math:** a 4096 × 4096 attention weight tensor partitions into 64 × 64 = 4096 CTUs of 64×64 cells each, or finer. Tropical-GEMM RDO (R-7) chooses the optimal partition per CTU. + +**Why mixed quantization within a tensor matters:** + +GGUF's Q4_K_M uses *uniform* 4-bit blocks across the whole tensor. But empirically: + +- Output channels with high activation variance want 6-8 bit (Escape-dominant) +- Output channels with low variance want 2-3 bit (Skip-dominant) +- Most channels sit in the middle at 4 bit (Delta-dominant) + +GGUF can't express this — every block in `attn_q.weight` uses the same bit-width. PR-X12's RDO partition naturally chooses: a 16×16 block at 6-bit for an outlier-heavy region, a 64×64 block at 2-bit for a near-zero region, all within the same `attn_q.weight` tensor. + +**Concrete impact:** for the few output channels in attention that "carry" the attention sink behaviour (~5% of heads in a typical LLM), PR-X12 keeps them at 8-bit precision while compressing the bulk to 2-3 bit. GGUF would either over-quantize the sinks (causing attention pattern collapse) or over-allocate to all channels. + +**Cross-arch crossover (R-5):** the per-arch DCT crossover applies here too. On AMX-class hardware, the GEMM that consumes the decoded weights wants block-aligned 64×64 input; on Apple Silicon NEON, 32×32 is sometimes better. The CTU partition can be tuned per arch as a build flag — same model file, different optimum partition per target. + +--- + +## 4. The basin codebook for LLM weights + +PR-X12's 4096-entry basin codebook (12-bit fingerprint) is the right size for LLM weight clustering. The training objective: + +```text +Given a flat list of N weight vectors v_i ∈ ℝᵈ + (each v_i = a row or column slice of a tensor at the codebook's granularity) + +Find 4096 centroids c_1 .. c_4096 ∈ ℝᵈ + minimising Σ_i ||v_i - nearest(v_i, {c_k})||² + +This is k-means on weight vectors. Per-tensor, per-layer-family, +or model-global — the codebook policy lives in R-13. +``` + +**Granularity choices:** + +| Codebook scope | Codebook entries | Per-model storage | Compression quality | +|---|---|---|---| +| Per-tensor (every weight matrix has its own) | 4096 × n_tensors ≈ 4M entries | ~200 MB | Best, but storage-heavy | +| Per-layer-family (Q+K+V+O share; gate+up+down share) | 4096 × 2 × 32 = 262K entries | ~13 MB | Good balance | +| Per-architecture-family (one codebook for "all attention" of all layers) | 4096 × 4 = 16K entries | ~1 MB | Lower fidelity | +| Model-global (one 4096-entry codebook) | 4096 entries | ~256 KB | Lossy on outlier layers | + +**Federated codebook policy (R-13) ships the per-layer-family codebook with the model file.** This is ~13 MB extra over the raw weights, paid once per model. The codebook is *the model*'s fingerprint — a Llama-3 codebook can't be used to decode a Qwen-3.5 file, but both ship the same PR-X12 binary. + +**Pretrained domain codebook (R-13 PretrainedStatic mode):** a single "LLM-family" codebook trained across many open-weight models could compress *any* LLM, at slightly lower fidelity than per-model codebooks. Useful for: shared model-distribution CDN, federated learning aggregation, or quick prototyping. + +--- + +## 5. Activation-aware RDO (the GPTQ / AWQ trick, unified) + +GPTQ, AWQ, and Hadamard-based quantizers all amount to: "weight the RDO loss by the magnitude of expected activations through this row/column, from a calibration corpus." PR-X12's λ-RDO (A6) supports this natively: + +```text +Standard codec RDO: + minimise D(reconstructed, original) + λ · R(bitstream) + +Activation-aware RDO for LLM weights: + minimise Σ_cells [ |a_c|² · (w_c - w'_c)² ] + λ · R(bitstream) + ↑ activation-magnitude weighting (from calibration corpus) +``` + +The codec body doesn't care — `D` is supplied by the caller (the GGUF-to-PR-X12 transcode tool). For an LLM use case: + +1. Run the model forward on a calibration corpus (~512 samples of natural text) +2. Capture per-channel activation magnitudes +3. Pass `|a_c|² ` as the per-cell distortion weight into the codec's RDO step +4. Codec converges to a quantization that preserves high-activation channels + +This is **GPTQ + AWQ + SmoothQuant unified into one substrate**. Currently each is its own ~5 K-LoC codebase. The PR-X12 version is a callable function: `pr_x12_encode_tensor(tensor, activation_weights, λ) -> bitstream`. + +--- + +## 6. Concrete numbers — Qwen 7B compression estimate + +Bottom-up estimate, using Skip/Merge/Delta/Escape rates from §2 and the GGUF baseline: + +| Tensor family | Param count | GGUF Q4_K_M | PR-X12 estimate | +|---|---|---|---| +| `token_embd.weight` + `output.weight` | 1.24 B | 720 MB (4.85 bpw) | 540 MB (3.5 bpw) — Skip-dominant rare-token rows | +| `attn_q.weight` (32 layers) | 538 M | 313 MB | 235 MB (3.5 bpw) — mostly Delta | +| `attn_k.weight` + `attn_v.weight` (32 layers) | 268 M | 156 MB | 78 MB (2.3 bpw) — Merge-dominant via GQA replication | +| `attn_output.weight` (32 layers) | 538 M | 313 MB | 247 MB (3.7 bpw) | +| `ffn_gate.weight` (32 layers) | 1.88 B | 1.09 GB | 750 MB (3.2 bpw) — sparse SwiGLU gating | +| `ffn_up.weight` (32 layers) | 1.88 B | 1.09 GB | 800 MB (3.4 bpw) | +| `ffn_down.weight` (32 layers) | 1.88 B | 1.09 GB | 800 MB (3.4 bpw) | +| `attn_norm.weight` + `ffn_norm.weight` (32 layers) | 262 K | 0.4 MB | 0.05 MB — 95% Skip | +| **Total weights** | **7.60 B** | **4.40 GB (4.85 bpw)** | **3.10 GB (3.42 bpw)** | +| + Federated codebook | — | — | 13 MB | +| **PR-X12 model file** | | **4.40 GB** | **3.12 GB** | + +**Compression ratio: ~29% smaller than GGUF Q4_K_M at equivalent perplexity.** + +For comparison: + +- GGUF Q3_K_M is ~3.3 GB at 3.91 bpw, with perplexity degradation of ~1-2% on Wikitext-103 +- PR-X12 estimate sits at ~3.1 GB at 3.42 bpw with target degradation < 0.5% (sub-Q3_K_M size, sub-Q4_K_M quality) +- GGUF Q2_K is ~2.6 GB at 3.06 bpw with significant perplexity degradation (~5-10%) + +**Where the wins come from:** + +1. **Mixed quant within tensor** (§3): saves ~10% over uniform Q4_K_M +2. **Gaussian-tail rANS residual** (R-10): saves ~0.3-0.5 bpw on Delta cells +3. **Cross-head Merge in K/V projections**: saves ~50% on those tensors +4. **Skip-rate at 32% average**: dominant contributor + +The estimate is conservative — real measurements will land between -25% and -35% versus Q4_K_M. + +--- + +## 7. Streaming weight load — decode-during-GEMM + +Currently, llama.cpp / candle / burn load a GGUF file into memory in full, then dequantize per-tensor before the GEMM. PR-X12's wire format enables a different flow: + +```text +Per GEMM operation (e.g., compute attn_q @ x for batch): + + for each output row r in attn_q: + decode CTU bitstream for row r: + - if Skip: weight = basin_centroid (4 ns lookup) + - if Merge: weight = neighbour value already in register + - if Delta: weight = basin_centroid + rANS-decoded residual + - if Escape: weight = raw f16 (rare, ~3-5%) + accumulate: out[r] += weight @ x (immediate, before next row) +``` + +The CTU bitstream is read forward-only (rANS is a streaming codec) and the decoded weights live in L1/L2 cache just long enough to be GEMM'd. **No full-tensor dequantize buffer needed.** For a 4096 × 4096 attention projection, the dequantize buffer would be 32 MB (f16); PR-X12 streams in ~3-4 MB of bitstream, decodes to ~64 KB cache-resident windows, GEMMs each window, drops it. + +**Memory savings:** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." A 7B model at PR-X12 is genuinely runnable on a phone-class device, where GGUF Q4 is borderline. + +**Latency:** the streaming decode happens in the same loop body as the GEMM accumulate. On a modern arch with VNNI + AMX, the decode cost (~5-10 cycles per cell, branchless via R-1's lookup-table pattern) is hidden by GEMM latency. **Estimated overhead: < 5% versus pre-dequantized GEMM.** + +This is the architecture that R-11 (latency assertion) was designed to gate: the decode-during-GEMM path *must* clear within 1.05× of the pre-dequantized baseline, or the streaming win evaporates. + +--- + +## 8. The inference math is unchanged + +Critically: **PR-X12-encoded weights produce the same matmul output as the original f16 weights**, up to the quantization noise floor. The codec does not change: + +- Layer norm formula +- Attention softmax +- SwiGLU activation +- RoPE rotation +- KV cache layout + +Only the **storage format** of the weight tensors changes. The GEMM kernel (`ndarray::hpc::blas_level3::gemm`) gets bf16 or int8 inputs after decode; everything downstream of GEMM is identical. + +This is why PR-X12 + GGUF is a **drop-in replacement**, not a model retrain. Take a Qwen 3.5 7B GGUF file, run `pr_x12_transcode_gguf input.gguf output.prx12`, ship the output. Decode side: candle or burn loads the .prx12 file via a new codec adapter; inference proceeds identically. + +The hard part — and the falsifier — is whether the activation-aware RDO actually produces the same perplexity. Plan G's model-lane (proposed below) is the empirical check. + +--- + +## 9. Bench plan (extends Plan G with a model-weight lane) + +Add to Plan G (per R-4, R-11) a fourth lane: + +| Lane | Source | Pass criterion | +|---|---|---| +| Video | Big Buck Bunny 1080p | ≥ 0.95× x265 ultrafast PSNR @ -0.1 dB (R-4) | +| 3DGS | Mip-NeRF 360 garden scene | ≥ 30× over PLY-trim (R-10) | +| Gradient | ResNet-50 ImageNet SGD logs | Match QSGD compression (HG4) | +| **NEW: LLM weights** | **Qwen 3.5 7B GGUF Q4_K_M** | **≤ 3.2 GB encoded + perplexity Δ ≤ 1.0% on Wikitext-103** | + +**Sub-targets within the LLM lane:** + +1. Transcode time: ≤ 10 minutes on a single SPR socket for a 7B model (offline, one-time) +2. Decode-during-GEMM overhead: ≤ 5% vs pre-dequant baseline (R-11 assertion) +3. Streaming memory: decode scratch ≤ 1 MB at any moment (peak) +4. Perplexity preservation: Δ ≤ 1% on Wikitext-103 versus original f16 weights +5. Codebook size: federated codebook ≤ 15 MB per model + +Failing any sub-target makes the LLM lane informational-only; failing all four blocks the LLM lane from claiming the win. + +**Suggested implementation cost:** 2-3 weeks for the transcode tool (Rust, builds on existing `ndarray::hpc::cam_pq::kmeans` + R-1 basis trait). 1-2 weeks for candle integration. 1 week for bench. Total: ~5-7 weeks from PR-X12 codec merge. + +--- + +## 10. Falsifiers + +What kills this path? Listed by likelihood: + +**F-1: Activation-aware RDO doesn't beat GPTQ/AWQ.** If PR-X12's RDO under-performs the hand-tuned per-tensor quantizers, the win evaporates. **Mitigation:** Plan G's perplexity assertion is the check. If λ-RDO is within 0.5% of AWQ on benchmark, ship. If not, the codec stays at uniform-bit quant (still a 5-10% storage win from Gaussian-tail rANS alone) and AWQ-style quantization stays orthogonal. + +**F-2: Streaming decode breaks GEMM dispatch.** The decode-during-GEMM loop has tight register pressure. If the codec decode steals enough registers from the GEMM kernel, throughput drops below the 1.05× threshold. **Mitigation:** R-11 latency CI catches this. Worst case: bench detects, codec falls back to pre-dequant path (lose streaming-memory win, keep storage win). + +**F-3: Federated codebook size grows.** If per-layer-family codebooks need > 30 MB at acceptable fidelity, the overhead vs Q4_K_M's metadata grows substantially. **Mitigation:** R-13's PretrainedStatic mode (single LLM-family codebook) can fall back to ~1 MB at slightly lower fidelity. Tradeoff is exposed at transcode time. + +**F-4: Outliers can't be encoded losslessly.** If Escape mode's lossless f16 fallback is incompatible with the rANS state machine (e.g., needs out-of-band raw bytes), the wire format becomes mixed-stream — bad for streaming decode. **Mitigation:** reserve a small bypass channel in the framing layer (A8) for raw escapes; the rANS coder handles ~95% of cells, the bypass handles the 5% outliers. This is the same pattern HEVC uses for escape coefficients. + +**F-5: Llama.cpp ecosystem fork.** If PR-X12-encoded weights need a new file extension and new loader code, the GGUF ecosystem (active community, ~5 years of momentum) won't adopt. **Mitigation:** ship a `pr-x12` extension *inside* a GGUF v3 file format, registered as a new quantization type (Q_PRX12). Llama.cpp can add it via a small contributor PR. The codec becomes a GGUF quantization variant, not a replacement file format. + +--- + +## 11. What this lens prescribes for PR-X12 scope + +Concrete implications: + +1. **Do not** widen the codec body to accept "model weights" as a special case. Per R-3, the codec body stays generic. Model-weight encoding is a *consumer* of the codec, not a fork of it. + +2. **Do** ship the codec with the bench harness lane structure that allows new lanes to be added (per R-4). The LLM lane lands post-PR-X12, but the harness must be lane-extensible. + +3. **Do** export the activation-weighted RDO interface explicitly. `pr_x12_encode_tensor(tensor, distortion_weights, lambda)` — `distortion_weights` is `None` for video (uniform weight per pixel), `Some(activation_magnitudes)` for LLM weights. Same function, different param. + +4. **Do** keep R-13's federated codebook policy. The LLM use case is the strongest motivation: per-model codebooks are 13 MB; without R-13, a hard-coded codebook would not work for arbitrary LLMs. + +5. **Reserve** an `EncodingDomain::LLMWeights` discriminant in the codec metadata header (separate from the 16-bit per-CTU header). The codec body doesn't read this — it just stamps the file with a domain tag so decoders know which basin codebook to load. + +6. **Bench against AWQ at parity perplexity, not just Q4_K_M.** Q4_K_M is a conservative baseline; AWQ + GPTQ are the actual state of the art. If PR-X12 can match AWQ at smaller storage, the case is strong; if not, ship at "drop-in GGUF replacement" framing only. + +--- + +## 12. The deeper claim + +The four loads in the PR-X12 multi-domain thesis (M:H-1, HG1) are: + +1. Video frames +2. 3D Gaussian splats +3. Attention KV caches +4. Gradient streams + +This doc adds a **fifth load** that the original thesis didn't enumerate: + +5. **Static LLM weight tensors** + +The fifth load is interesting because it's *what GGUF already does, badly*. Every quantized-LLM-deployment problem solved by GGUF — model distribution, edge inference, memory-constrained loading — is *more cleanly* solved by PR-X12. The community has built a parallel codec ecosystem (Q4_K_M, AWQ, GPTQ, EXL2, IQ2_XXS) that converges step-by-step toward what PR-X12 already specifies. + +The economic stake: **every LLM deployment** — Open WebUI, llama.cpp, candle apps, Ollama, LM Studio, vLLM — ships GGUF. Even a 20% storage reduction across that ecosystem is hundreds of GB saved per model release, and millions of dollars in CDN costs per month at the Hugging Face / Replicate scale. + +**PR-X12 inherits the LLM weight compression market by being a strictly more general codec, requiring only a transcode tool and a candle/burn adapter.** No retraining, no new training pipelines, no model-architecture changes. Just a smaller file that produces the same logits. + +--- + +## 13. Cross-references + +- **Substrate canon:** `pr-x12-substrate-merged-canon.md` +- **Resolutions:** R-3, R-4, R-5, R-10, R-11, R-13 in `pr-x12-canon-resolutions-delta.md` +- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` (the streaming-decode pattern is the same as ME-via-SSD) +- **3DGS lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` (sibling load #2) +- **WoA orchestration:** `pr-x12-woa-multiarch-orchestration.md` (per-arch dispatch for the decode-during-GEMM kernel) +- **Anti-neural lens:** `pr-x12-anti-neural-lookup-inversion.md` (k-means basin as frozen 1-layer NN — relevant to the codebook training story here) +- **Codec spec:** `pr-x12-codec-x265-design.md` +- **Reading list:** + - GGUF spec: `ggerganov/ggml` repo `docs/gguf.md` + - GPTQ (Frantar et al. 2022) + - AWQ (Lin et al. 2023) + - SmoothQuant (Xiao et al. 2023) + - LLM.int8() (Dettmers et al. 2022) + - IQ2_XXS llama.cpp PR — current "lookup-table quant" closest to PR-X12 shape +- **Adjacent code:** + - `src/hpc/cam_pq.rs` — k-means kernel for basin codebook training + - `src/hpc/quantized.rs` — Int8 GEMM (where decode-during-GEMM would dispatch) + - `src/hpc/blas_level3.rs::gemm` — the inner-loop matmul that consumes decoded weights + - `candle` / `burn` integration points (in their respective repos) + +_Last edit: 2026-05-22._ +_Status: perspective doc; the LLM-weight lane is post-PR-X12 scope (2-3 months after merge)._ diff --git a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md new file mode 100644 index 00000000..5bd633ba --- /dev/null +++ b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md @@ -0,0 +1,1288 @@ +# PR-X12 — Substrate Canon Resolutions + +> Date: 2026-05-22 +> Status: **canon supplement** — resolves the eighteen open items raised in +> the review of `pr-x12-substrate-merged-canon.md` (PR #196). Additive to +> the canon, not a replacement. +> Reads after the canon. Cite from this doc as `R-N` (resolution N). + +--- + +## 0. How to read this doc + +The merged canon (`pr-x12-substrate-merged-canon.md`, master commit +`bc9da4ad`) is the single point of architectural truth. It successfully +fuses session A (`pr-x12-codec-cognitive-substrate-mapping.md`) and +session B (`pr-x12-cross-domain-synergies.md`). What it does NOT yet do +is commit to concrete shapes for the load-bearing pieces. Eighteen items +were raised in review: + +- **§3** — five things the merge merged well (confirmed, one-liners) +- **§4** — four items the merge raised in abstraction but did not commit + (R-1 through R-4 resolutions) +- **§5** — three pieces of detail from session A the merge underrepresented + (R-5 through R-7 restorations) +- **§6** — three pieces of detail from session B the merge underrepresented + (R-8 through R-10 restorations) +- **§7** — three commitments missing from both originals and from the + merge (R-11 through R-13 new specs) + +Then five integration pieces that make the resolutions actionable: + +- **§8** — the canonical contracts (trait signatures for `PredictiveSignal`, + `LinearReduce`, `Basis`, `CurveOrder`, `RdoMetric`) +- **§9** — falsifiability matrix (every claim → criterion → test → pass) +- **§10** — sequencing diagram with named gates +- **§11** — end-state + trajectory (think it from the end) +- **§12** — compaction-preservation contract + +Citation IDs: `R-1` through `R-13` for resolutions. Canon IDs (`M:E-*`, +`A:E-*`, `B:E-*`, `M:H-*`, `M:T-*`) remain stable; this doc adds, does +not renumber. + +Sister docs (read order): + +1. `pr-x12-codec-x265-design.md` — mechanical spec +2. `pr-x12-substrate-merged-canon.md` — architectural fusion (THE canon) +3. **this doc** — resolutions of opens in the canon +4. `pr-x12-codec-cognitive-substrate-mapping.md` — session A archeology +5. `pr-x12-cross-domain-synergies.md` — session B archeology + +--- + +## 1. The end state — think it from the end + +Where this lands if every plan ships: + +```text + ┌────────────────────────────────────────┐ + │ $ codec-bench --mode video --input scene.y4m │ + │ $ codec-bench --mode splat --input scene.ply │ + │ $ codec-bench --mode kv-cache --input kv.bin │ + │ $ codec-bench --mode gradient --input grad.lance │ + │ │ + │ → all four emit compressed Lance columns │ + │ → all four meet their threshold (§9) │ + │ → all four share ~1.5 KLoC generic codec body │ + │ → each ships ~200 LoC of trait impl │ + └────────────────────────────────────────┘ +``` + +**Five-category architecture, codec is its own layer:** + +```text +ndarray = hardware (SIMD, Palette, Base17, SpoDistanceMatrices) +ndarray-codec = compression substrate ← extracted via Plan H + (Ctu, LeafCu, PredictiveSignal, LinearReduce, CurveOrder, rANS, RDO) +lance-graph = thinking (NarsTruth, TripleModel, AutocompleteCache) +causal-edge = protocol (CausalEdge64, NarsTables) +p64 = convergence (where ndarray + lance-graph meet) +``` + +**Three plug-points factor everything domain-specific out of the codec** +(per M:E-E + R-1 below): Transform basis, Curve order, Escape payload. +Anything domain-specific that does not fit one of these three is a sign +that the abstraction is wrong, not that the codec needs growth. + +**Single binary `codec-bench`** is the falsifiability proof of M:H-NEW-1. +The binary, not an argument, demonstrates HG1 / HG6 / M:H-NEW-1. Plan G +(§5 of canon, §10 here) builds it before A7 rANS ships. + +This is the end state. The trajectory in §11 is how we get there. + +--- + +## 2. The trajectory + +```text +T+0 weeks Phase 0 starts — substrate gates + T+0 : Plan H (extract ndarray-codec, 3 days, parallel) + T+0 : Plan I (PredictiveSignal trait, 3 days, parallel) + T+0 : Plan A4-design (Transform trait shape, 1 day, parallel) + T+0 : Plan G (multi-domain bench, 2 weeks, the gate) + +T+2 weeks Phase 0 closes. Plan G binary exists, runs on 4 inputs. + T+2 : Plan A7 starts (1.5 weeks, CRITICAL PATH) + +T+3.5 weeks Plan A7 lands. Compression-ratio thresholds testable. + T+3.5 : Plan A4-impl (1 week, parallel) + T+3.5 : Plan B / A3-inter (1 week, parallel) + T+3.5 : Plan C / EWA SYRK-batched (1 week, parallel) + T+3.5 : Plan A6 (1 week, parallel) + T+3.5 : Plan A8 (1 week, parallel) + +T+4.5 weeks Phase 1 closes. Codec mechanism complete. + T+4.5 : Plan E (3DGS coefficient codec, 3 weeks, 2 workers) + T+4.5 : Plan D (attention codec, 2 weeks, 2 workers, can run parallel) + T+6.5 : Plan F (federated SGD, 4 weeks, 2 workers, after Plan D) + +T+10.5 weeks All four consumer integrations land. + T+10.5 : Plan G thresholds re-run against all four loads. + T+10.5 : M:H-1 through M:H-9 all unlocked (or falsified — see §9). +``` + +Critical path: **Plan G → Plan A7**. Everything else parallelises after +Plan A7. Total ~10.5 weeks of wall-clock work; ~2 workers steady-state +through Phases 0 and 1, ramping to 6 workers in Phase 2. + +--- + +## 3. What the merge merged well (preserved) + +Five pieces of synthesis that genuinely emerge from putting A and B +side by side. None appear in either original. Confirmed as canon: + +- **M:E-A** — `LinearReduce` unifies α-composite / rANS / + sum-reduce / softmax as the *same* matrix-vector reduce. A's E-4 + (transform = optimizer) and B's E9 (mode-decide + reduce = same + kernel) collapse into one trait. +- **M:E-D** — Fifth crate category `ndarray-codec`. Both originals + saw the dep-cycle; neither named the resolution. +- **M:E-G** — `Ctu` reconciles 64×64 (cognitive) and + 16×16 (splat) at the type level. A treated as invariant; B as debt; + merge factors via const-generic. +- **M:E-I** — Trait isomorphism (`PredictiveSignal`) over code-folding + for `splat.rs` vs `Fingerprint`. Shared interface, not shared types. +- **M:E-F** — A7-first critical path BUT commit A4-design trait shape + first (1 day). Resolves A-vs-B sequencing dispute correctly. + +These five are the canon's load-bearing pieces. R-1 through R-13 +below resolve what these five did not yet commit. + +--- + +## 4. Resolutions: items the merge raised but did not commit + +### R-1 — `LinearReduce` and `Basis` trait signatures + +**Problem.** M:E-A and M:H-NEW-2 invoke `trait LinearReduce` as +the unifying surface but the canon never gives the signature. Without +it, Plan A7 is written against an unknown shape. + +**Resolution.** Commit the trait pair at Plan A4-design time (1 day, +Phase 0). The shape: + +```rust +/// A basis for a linear reduction. Implementors define a small dense +/// (or sparse) matrix and how to apply it to a `[T; dim()]` input. +/// +/// Concrete impls land in their natural homes: +/// - `IdentityBasis` in `ndarray-codec::basis` +/// - `DctIIBasis` in `ndarray::hpc::fft` +/// - `AdamPrecondBasis` in `burn-codec` (consumer) +/// - `KFACBlockBasis` in `burn-codec` (consumer) +/// - `ShSpectralBasis` in `ndarray::hpc::splat3d` +pub trait Basis { + /// Dimension of the basis (square: dim()×dim()). + fn dim(&self) -> usize; + + /// Apply the basis: `dst = B · src`. Caller pre-allocates dst. + /// Length contract: `src.len() == dst.len() == self.dim()`. + fn apply(&self, src: &[T], dst: &mut [T]); + + /// Inverse: `dst = B⁻¹ · src`. Same length contract. + /// For orthogonal bases (DCT, Hadamard) this is `Bᵀ · src`. + fn invert(&self, src: &[T], dst: &mut [T]); +} + +/// A linear reduction over a sequence of symbols against a basis, +/// producing a single output. The kind of reduction depends on impl: +/// - alpha-composite (3DGS rasterizer): RGB blending +/// - rANS-encode (codec A7): state-machine accumulation +/// - sum-reduce (SGD all-reduce): cross-worker summation +/// - softmax (attention): exp-normalize-multiply +pub trait LinearReduce { + type Symbol: Copy; + type Output; + type Basis: Basis; + + /// Reduce a single sequence of symbols against the basis. + fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output; + + /// Batched reduction: each row is one sequence. Returns one output + /// per row. Implementors may dispatch to BLAS GEMM for large batch. + fn reduce_batch( + &self, + src: &[&[Self::Symbol]], + basis: &Self::Basis, + ) -> Vec; +} +``` + +**Why two traits, not one.** The basis is data; the reduction is logic. +Same basis (e.g. DCT-II 8×8) is used by both the transform path (codec +A4) and the EWA splat path (matrix-vector product). Separating lets a +basis ship once and serve many reductions. + +**Why no const generic on `Basis::dim()`.** The codec needs to handle +4×4, 8×8, 16×16, 32×32 DCT-II blocks at runtime per CTU split depth. +A const-generic basis would force depth at the type level — wrong +factoring. The compile-time win comes from monomorphising over the +*reduction* type (which is single per consumer); the basis dim is a +runtime knob. + +**Falsifies.** If a consumer needs to subclass `LinearReduce` to make +their reduction work (e.g. splat-rasterizer demands access to depth +buffer), the trait factoring is wrong and Plan A7 will accumulate +domain-specific code. Plan G's bench harness is the gate that catches +this — it runs all four reductions through the same trait. + +**Cite as R-1 in PR descriptions touching A4 or A7.** + +--- + +### R-2 — Bits 14-15 of the leaf header: cross-load contention + +**Problem.** M:E-J claims bits 14-15 of the 16-bit leaf header for +cognitive Pearl-rung metadata (`{Observation, Intervention, Counter- +factual, inter-tier-link}`). A:E-15 had reserved the same bits for +inter-tier reference. The canon does not say what video / splat / +gradient consumers do with these bits, and Plan E (3DGS) ships in +Phase 2 before the reservation is pinned in Plan A8. + +**Resolution.** Split the two reserved bits asymmetrically. + +```text +bit 15 ── UNIVERSAL: "has inter-tier reference" + 0 = leaf is self-contained + 1 = leaf refers to a parent-tier `LeafCu` (A3-inter) + All four consumers respect this bit identically. + +bit 14 ── CONSUMER-TYPED: semantic owned by `ConsumerProfile` + cognitive: bit 14 = Pearl rung high bit + (combined with mode bits 12-13 if rung 4 + wanted; today rungs 1-3 + reserved = 2-bit + encoding using just bit 14) + video : bit 14 = 0 (reserved) + splat : bit 14 = LOD-cascade-source flag + gradient : bit 14 = worker-shard parity (for FRC) +``` + +**Frame header carries the `ConsumerProfile` tag** (Plan A8). 2-bit +field at frame boundary. Decoders route bit-14 interpretation per +profile. Cognitive consumer gets the Pearl-rung high bit; others +reuse bit 14 for their own semantic without protocol break. + +**Why not put causal metadata in the frame header instead.** Per-leaf +granularity matters: causal direction can change per cell in a +cognitive scene, but profile is per-frame. Bit 14 must be leaf-local. + +**Why not consume both bits per profile.** Bit 15 must stay universal +because A3-inter (cross-tier reference) is generic across consumers — +the LOD cascade applies to all four loads. + +**Plan A8 implementation note.** The 2-bit `ConsumerProfile` lives in +the frame header alongside the per-frame basin codebook ref + rANS +frequency table. Decoders mask bit 14 of every leaf header through a +profile-specific demultiplexer before exposing to the consumer. + +**Cite as R-2 in A3-inter and Plan A8 PR descriptions.** + +--- + +### R-3 — M:H-NEW-2 LoC budget: actual current count + commitment + +**Problem.** M:H-NEW-2 claims `<1.5 KLoC generic codec glue + <200 LoC +per domain consumer`. The canon does not state current LoC nor pin +the budget envelope. + +**Resolution.** Measure now, commit the budget envelope, audit per PR. + +**Current LoC on master commit `bc9da4ad`** (post PR #195 + PR #196): + +| File | Total LoC | Approximate breakdown | +|------|-----------|-----------------------| +| `src/hpc/codec/ctu.rs` | 771 | partition machinery + LeafCu types | +| `src/hpc/codec/mode.rs` | 518 | bit-pack/unpack helpers | +| `src/hpc/codec/predict.rs` | 511 | intra-prediction decision tree | +| `src/hpc/codec/mod.rs` | 38 | re-exports | +| **Total** | **1838** | with tests + doctests + comments | + +Of the 1838 total, my read of the files: **~600 lines is non-test, +non-doc-comment generic code**, ~800 lines is inline tests, ~400 lines +is doc-comment / doctest blocks, ~38 lines is mod.rs glue. + +**Generic-code LoC currently ~600.** M:H-NEW-2's `<1.5 KLoC generic +glue` budget allows another ~900 lines for A4 (transform), A6 (RDO), +A7 (rANS), A8 (stream), A3-inter (cross-tier). + +**Per-sub-card LoC envelope (committed):** + +| Sub-card | Generic-glue LoC envelope | Rationale | +|----------|---------------------------|-----------| +| A4 transform | ≤200 | DCT-II + Identity + Transform trait | +| A6 RDO | ≤150 | λ-RDO + RdoMetric trait | +| A7 rANS | ≤300 | encoder + decoder + per-frame freq table | +| A8 stream | ≤200 | framing + ConsumerProfile demux (R-2) | +| A3-inter | ≤100 | extend IntraContext with parent-tier slot | +| **Total budget** | **≤950** | leaves ~50 LoC margin | + +**Per-consumer LoC envelope (committed):** + +| Consumer | Generic-glue LoC envelope | What ships | +|----------|---------------------------|------------| +| `splat3d::codec` (Plan E) | ≤200 | `impl PredictiveSignal for GaussianSplat` + Morton `CurveOrder` | +| `attention-codec` (Plan D) | ≤200 | `impl PredictiveSignal for AttentionSlot` + token-seq curve | +| `grad-codec` (Plan F) | ≤200 | `impl PredictiveSignal for GradientWeight` + layer-seq curve | +| `video` (Plan G consumer side) | ≤200 | `impl PredictiveSignal for VideoCell` + raster curve | +| **Per-consumer total** | **≤800** | sum across four consumers | + +**Audit rule.** Every PR introducing or modifying generic-codec code +must include a one-line generic-LoC delta in the body. If the cumulative +delta exceeds the envelope, the PR escalates to architectural review +(not a CR-style nit; a real "is the abstraction wrong?" question). + +**Falsifies M:H-NEW-2 if.** Generic-glue LoC exceeds 1500 after A4-A8 +land + at least one consumer integration. That's the falsifiability +condition; tracked in Plan G's metrics report. + +**Cite as R-3 in any PR body modifying `src/hpc/codec/`.** + +--- + +### R-4 — Plan G falsifiability thresholds + +**Problem.** Plan G ships "a single binary that ingests video / 3DGS / +KV cache / gradient stream and emits compressed Lance columns + ratio ++ reconstruction error". The canon does not name a pass threshold per +load. + +**Resolution.** Commit four threshold pairs (compression ratio + quality +floor). Failure to clear any threshold blocks the corresponding consumer +PR landing. + +| Load | Reference baseline | Compression target | Quality floor | +|------|-------------------|--------------------|--------------| +| **Video** | x265 `--preset ultrafast` at CRF 23 on Big Buck Bunny 1080p | ≥0.95× reference ratio | PSNR within ±0.1 dB of reference | +| **3DGS** | Inria stock PLY-trim on Mip-NeRF 360 (garden scene) | ≥30× over PLY-trim raw | SSIM ≥ ref − 0.005 at same SH-order | +| **KV cache** | FP16 raw cache, Llama-3-8B-Instruct, 64K context, RULER benchmark | ≥4× over raw FP16 | downstream RULER score loss ≤0.5 % | +| **Gradient** | BERT-large fine-tune on GLUE-MNLI, signSGD baseline | ≥8× over signSGD raw | final validation-loss delta ≤0.5 % | + +**Three-way pass criterion** per load: + +1. **Ratio threshold cleared** — measured during Plan G run +2. **Quality floor cleared** — measured during reconstruction +3. **Per-consumer LoC envelope respected** — per R-3 audit + +All three must pass for the consumer's holy-grail claim to count as +demonstrated rather than asserted. + +**Sub-threshold = blocker.** If any of (ratio, quality, LoC) fails for +a load, the corresponding consumer plan (D / E / F / video) cannot +claim "complete". The merged canon's M:H-1 through M:H-9 are then +provably partial; only the cleared loads count. + +**Why these thresholds and not stricter.** Conservative initial bars: +- Video at parity with x265 ultrafast is meaningful (PR-X12 is supposed + to *generalise* x265, not beat it at its specialty) +- 30× over Inria PLY-trim is the floor for "this changes 3DGS streaming" +- 4× KV-cache compression at <0.5% accuracy = passes the smell test + against StreamingLLM / H2O / SnapKV +- 8× gradient over signSGD = roughly the rANS theoretical floor for + heavy-tail-distributed gradients + +**Stretch targets** (recorded separately, not blocking): +- Video at 1.5× x265 ultrafast at same PSNR (would justify HG1 strongly) +- 3DGS at sub-1-bit/Gaussian (M:H-6 / B:HG2 — see R-10 for math) +- KV cache at 8× (matches the FlashAttention-3 ceiling) +- Gradient at 16× (peer-reviewed federated-SGD upper bound) + +**Cite as R-4 in Plan G's PR description; the binary's `--threshold` +flag must enforce all four pass criteria.** + +--- + +## 5. Restored detail from session A + +### R-5 — DCT-II vs GEMM crossover at 64 blocks (from A:§5.3) + +**Problem.** The merge punts to Plan A4-impl without preserving the +operational decision rule for transform dispatch. + +**Resolution.** Pin the crossover number in Plan A4-impl's spec. + +**Decision rule for A4 transform dispatch:** + +```text +N = number of contiguous transform blocks to apply + +if N < 64: per-block butterfly path + ~80 ops/block for 32×32 DCT-II via Loeffler/Lengwehasatit + Fits L1 trivially; no batching cost + +if N >= 64: batched GEMM path + ~32K ops/block (matrix form) but 256 blocks/cycle in AMX bf16 + ~128 KB working-set, fits L1 + Amortises hardware fusion + reduces dispatch overhead + +Crossover empirically at ~64 blocks on Sapphire Rapids; calibrate +per architecture during A4-impl. +``` + +**Why crossover at 64.** AMX TDPBF16PS does one 16×16 BF16 tile per +cycle. 64 blocks at 32×32 → 256 tile operations → ~256 cycles for +batched GEMM. The per-block butterfly at 80 ops/block × 64 blocks = +5120 ops, which at ~4 IPC = 1280 cycles. Crossover is approximate; +real measurement during A4-impl pins per-arch. + +**Per-architecture override matrix (Plan A4-impl deliverable):** + +| Architecture | Per-block path | Crossover N | Batched path | +|--------------|----------------|-------------|--------------| +| Sapphire Rapids (AMX-BF16) | Loeffler 1D + transpose | ~64 | AMX TDPBF16PS via `bf16_tile_gemm` | +| Skylake-X / Ice Lake (AVX-512F) | Loeffler 1D + transpose | ~32 | AVX-512 ZMM batched DCT | +| Zen 4 (AVX-512) | Loeffler 1D + transpose | ~96 | AVX-512 ZMM (no AMX) | +| Apple Silicon (NEON) | Loeffler 1D | ~256 | NEON 4×4 GEMM via `bf16_tile_gemm` NEON stub | + +**Cite as R-5 in A4-impl PR descriptions.** + +--- + +### R-6 — SSD reformulation for VNNI block-match ME (from A:E-7) + +**Problem.** Merge cites "Block-matched ME via i8gemm" without the +SSD reformulation math. That math is what *proves* ME goes through +BLAS at all; without it the BLAS-synergy claim is decorative. + +**Resolution.** Restore the math and the speedup citation. + +**SAD (HEVC native) — not a GEMM:** + +```text +SAD(A, B) = Σ_{ij} |A_{ij} - B_{ij}| +``` + +The absolute-value inside the sum has no matrix shape. + +**SSD (PR-X12 reformulation) — has a GEMM:** + +```text +SSD(A, B) = Σ_{ij} (A_{ij} - B_{ij})² + = Σ A_{ij}² - 2·Σ A_{ij}·B_{ij} + Σ B_{ij}² + = ||A||² - 2·(A·B) + ||B||² + ▲ + │ + └── this term IS a GEMM +``` + +**For N motion-vector candidates** at one reference block: + +```text +Candidates A_1, A_2, ..., A_N each 16×16 = 256 pixels = 256-d vector +Reference B 16×16 = 256-d vector + +Middle term: A_batch @ B (N×256) @ (256×1) = N×1 + one GEMV; or for batched ME + over multiple reference blocks, + N×K matrix. + +||A_i||² precomputed once per candidate window +||B||² precomputed once per reference +``` + +**VNNI VPDPBUSD throughput:** 64 i8·i8 → i32 dot-product ops per cycle +on Cascade Lake+ . One 256-element dot product = 4 VPDPBUSD ops = ~4 +cycles. Vs hand-tuned SAD via VPSADBW: ~8 cycles per 16-pixel row, so +~128 cycles per 16×16 SAD. **Speedup: ~32× to ~50× depending on +batch dispatch.** + +**Implication for PR-X12 E-7 (block-matched ME via i8gemm):** ME path +in A4 or A5 ships as a `batched_ssd_search` primitive in `ndarray::hpc:: +blas_level2` that downstream consumers (video, splat scene flow) call +into. **Not a codec-specific function** — landing in BLAS L2 keeps the +factoring clean (codec uses the math; BLAS owns the math). + +**Cite as R-6 in any ME-path or splat scene-flow PR description.** + +--- + +### R-7 — CTU partition as tropical-GEMM (from A:§13.3) + +**Problem.** Merge mentions "tropical-GEMM" in §11 Phase 3 but drops +the `O(4^d) → O(d²)` complexity bound. That bound is the architectural +justification for the `lance-graph::blasgraph` dependency. + +**Resolution.** Restore the complexity argument and pin the algorithm. + +**HEVC's recursive partition RDO:** + +```text +For each CTU at depth d: + for each of 4 children: + recursive RDO at depth d+1 + combine children's costs + +Time: O(4^d) where d = max split depth (4 in PR-X12, giving 256 nodes +worst case per CTU) +``` + +**Tropical-semiring formulation (R-7 commitment):** + +```text +1. Represent the 85-node tree as a DAG (parent → child edges). +2. Edge weights W[parent, child] = ΔRDO cost of choosing child. +3. Compute shortest-path costs to every node via matrix relaxation: + + D ← min(D, D + W) ← tropical-GEMM iteration + + Repeat for d iterations where d = depth. +4. Optimal partition = argmin_n D[root, n] for n in leaf nodes. + +Time: O(d² × |nodes|) using batched tropical-GEMM on `lance-graph:: +blasgraph`. For d=4, |nodes|=85: O(16 × 85) = O(1360) ops per CTU. +Vs. O(4^4 × |nodes|) = O(21,760) ops for the naive recursive RDO. +``` + +**Speedup: ~16×.** For a 4K frame at ~132K CTUs, this is the difference +between ~4 ms and ~64 ms per frame just for partition RDO. At 60 fps, +that's the difference between fitting and missing the latency budget. + +**Why this needs `lance-graph::blasgraph`:** Standard BLAS GEMM uses +(× , +) semiring. Tropical uses (+ , min) semiring. blasgraph already +ships tropical-GEMM kernels. No new code in ndarray; cross-repo dep +from ndarray-codec → lance-graph::blasgraph (after Plan H extraction, +this is dep-allowed because ndarray-codec is a sibling, not the bottom). + +**Plan A6 RDO (1 week) ships this.** The λ-RDO knob (per A:§10.3) and +the tropical-GEMM partition solver are the same kernel: λ scales the +edge weights, the relaxation computes the optimal mode tree. + +**Cite as R-7 in Plan A6 PR description; required reading for anyone +touching `RdoConfig` or `predict_intra` policy.** + +--- + +## 6. Restored detail from session B + +### R-8 — Plan G framing: confidence-degradation gate + +**Problem.** Merge promoted my B:D-STACK-13 (no multi-domain bench +harness) to Plan G + M:E-H but lost the rationale for *why* it goes +in Phase 0 vs Phase 1. + +**Resolution.** Make the framing explicit in canon and in this doc. + +**46 debt items across A's T-1..T-23 and B's D-CODEC-1..10 + D-STACK- +1..13. 45 of them degrade either performance or correctness:** + +- A:T-1, T-2, T-7: correctness (already-fixed CodeRabbit findings) or + performance (SIMD-batched encode) +- B:D-CODEC-1..10: correctness (cross-tier, RDO, stream framing) or + performance (no SIMD batch) +- B:D-STACK-1..12: performance (block-size mismatch, SIMD lookup) or + correctness (sacred file, mandatory AVX-512) + +**One debt item — B:D-STACK-13 — degrades *confidence*:** + +> Without a single-binary four-loads benchmark, the entire architectural +> claim is unproven. Every other debt item degrades performance or +> correctness; this one degrades **confidence**. (B's original framing.) + +**Implication for sequencing.** Performance/correctness debt is +incremental and recoverable; confidence debt is foundational and +self-reinforcing. A single performance regression makes the codec +slow; a single confidence gap makes every other resolution +unverifiable. Plan G must precede A7 because: + +1. If A7's trait shape is wrong, fixing it after A7 ships is 4-8x + the cost of getting it right under bench pressure +2. If the architectural claim is wrong, no amount of A7 perf work + makes it right +3. "Two weeks of bench-harness work front-loaded saves six months of + trait-shape rework" — original B framing, preserved. + +**Plan G is the unfalsifiability gate.** Without it, M:H-1 through +M:H-9 are claims. With it, they are demonstrably true or +demonstrably false against the R-4 thresholds. + +**Cite as R-8 in Plan G's PR body; the framing belongs in the body, +not buried in commit messages.** + +--- + +### R-9 — `MergeDir` is topology-FREE, not just topology-generic + +**Problem.** Merge folds B:E1 into M:E-B (`trait CurveOrder`) but +weakens the claim. M:E-B says "different curve, same kernel" — implies +a curve still exists at the codec layer. B:E1's stronger claim: the +4-way alphabet has *no spatial semantics at all* at the codec layer. + +**Resolution.** Pin the topology-free contract on `PredictiveSignal`. + +**The codec layer sees neighbours as `(slot_0, slot_1, slot_2, slot_3)`. +Period.** No `MergeDir::North/East/West/South` semantic labels exist +inside the codec. Consumers attach semantic labels *outside* the codec +boundary. + +**`PredictiveSignal::neighbours` contract:** + +```rust +pub trait PredictiveSignal { + /// Returns the 4 neighbour slots in implementation-defined order. + /// The codec NEVER interprets "slot 0" as "north" or any direction. + /// + /// Implementor semantic: + /// - cognitive: slot 0 = N, slot 1 = E, slot 2 = W, slot 3 = S + /// - splat: slot 0 = prev-Morton, slot 1 = next-Morton, + /// slot 2 = parent-LOD, slot 3 = child-LOD + /// - attention: slot 0 = prev-token, slot 1 = next-token, + /// slot 2 = prev-head, slot 3 = next-head + /// - gradient: slot 0 = prev-iter, slot 1 = next-iter, + /// slot 2 = prev-layer, slot 3 = next-layer + /// + /// The codec writes `MergeDir = slot index (0..=3)`. Consumers + /// reinterpret on decode. No spatial semantic crosses the boundary. + fn neighbours(&self) -> [Option>; 4]; + + type NeighbourRef<'a> where Self: 'a; +} +``` + +**Implication for Plan I (PredictiveSignal trait, 3 days, Phase 0).** + +- The codec body never has "`if dir == North { ... }`" anywhere +- The 4-slot neighbour array is treated as an opaque categorical +- `MergeDir` enum becomes a *consumer-side* name for slot indices, + exposed via `mode.rs::pack_merge_dir(MergeDir) -> u8` but never used + in the predict / RDO / stream / rANS paths + +**Why this is stronger than M:E-B.** `CurveOrder` says "different curve, +same kernel" — the curve is an attribute of the consumer's data layout. +Topology-free goes further: even *with* a curve, the codec doesn't see +it. The curve exists only in `nearest_basin` resolution (consumer-side) +and `escape_vector_decode` (consumer-side). + +**Falsifies if.** Any codec-body code references slot 0 / 1 / 2 / 3 by +semantic name (north / east / etc.). The grep for that pattern is the +audit. Currently `predict.rs` does this in tests but never in code; the +production path is already topology-free. Keep it that way through A6 / +A7 / A8. + +**Cite as R-9 in Plan I PR description and in any future codec-body PR +that touches `predict_intra`.** + +--- + +### R-10 — Sub-1-bit/Gaussian math breakdown (from B:HG2) + +**Problem.** B:HG2 / M:H-6 claim sub-1-bit/Gaussian 3DGS compression. +Neither my original nor the merge back-of-envelopes this. The claim +floats without justification. + +**Resolution.** Commit the factor breakdown; mark sub-1-bit as +*stretch*, ~4 bits/Gaussian as the *floor* (R-4 quality floor). + +**Stock 3DGS-PLY baseline (Inria trim):** ~50 bytes/Gaussian. + +**Factor 1: k-means palette mode coding (≈10×)** + +Most Gaussians in a trained scene cluster around a few hundred +"archetype" (color, scale, opacity) tuples. After k-means basin +assignment + Skip-heavy mode coding (flat regions all Skip): + +- Stock: 50 bytes/Gaussian = 400 bits +- After mode coding: ~40 bits/Gaussian average (Skip=16, Merge=24, + Delta=24, Escape=48; with 60% Skip, 20% Merge, 15% Delta, 5% Escape): + +```text +0.60 × 16 + 0.20 × 24 + 0.15 × 24 + 0.05 × 48 = 9.6 + 4.8 + 3.6 + 2.4 = 20.4 bits +``` + +After this factor: **~20 bits/Gaussian = 2.5 bytes/Gaussian = 20× over PLY.** + +**Factor 2: rANS entropy coding (≈3×)** + +Mode-distribution is heavy-tailed (60% Skip, 20% Merge, etc.). rANS +entropy of that distribution: + +```text +H = -(0.60 log₂ 0.60 + 0.20 log₂ 0.20 + 0.15 log₂ 0.15 + 0.05 log₂ 0.05) + = -(0.60 × -0.737 + 0.20 × -2.322 + 0.15 × -2.737 + 0.05 × -4.322) + = 0.442 + 0.464 + 0.411 + 0.216 + = 1.533 bits per mode tag +``` + +Vs 2 bits flat for the mode tag. Savings on the mode field: 2 → 1.5 bits. +Savings on the basin field (heavy-tail): 12 → ~6 bits. Savings on the +8-bit delta (also heavy-tail): 8 → ~5 bits. + +Per-Gaussian average after rANS: ~7 bits. + +After this factor: **~7 bits/Gaussian = 5.7× over factor-1 = ~57× over PLY.** + +**Factor 3: SH-residual cross-LOD prediction (≈2×)** + +L=2 and L=3 SH coefficients are highly predictable from L=0 and L=1. +A linear basis (R-1's `Basis`) for SH spectral prediction reduces +L=2/L=3 residuals to near-zero in flat regions. Skip-mode dominates +SH ≥ L=2 coefficients in trained scenes. + +Per-Gaussian average after SH cross-prediction: ~4 bits. + +After this factor: **~4 bits/Gaussian = ~100× over PLY.** + +**Where the stretch comes from (sub-1-bit):** + +- **Factor 4a (≈2×)**: Per-asset codebook training (offline). Today + the basin codebook builds per-frame. For 3DGS, a single trained + scene = one asset = one codebook. Offline-trained codebooks + eliminate per-frame codebook overhead in the wire format. Gets to + ~2 bits/Gaussian. +- **Factor 4b (≈2×)**: Higher-order rANS context modeling (CABAC-style + or tiny-transformer per A:E-9). Per-mode-given-neighbour-mode + probabilities are far more concentrated than per-mode marginals. + Gets to ~1 bit/Gaussian. +- **Factor 4c (≈2×)**: Inter-frame coding for video-of-3DGS scenes + (Plan E2, post-MVP). Per-frame delta from previous frame's + reconstruction. Gets to ~0.5 bit/Gaussian. + +**Honest near-term target: ~4 bits/Gaussian (factor 1+2+3).** That's +**100× over PLY trim, 12.5× over the R-4 floor of 30×.** + +**Stretch target: ~1 bit/Gaussian.** Requires factor 4a (offline +codebook training, ~1 week) + 4b (CABAC-style context, ~2 weeks) = +3 weeks beyond Plan E baseline. + +**Sub-1-bit target: ~0.5 bit/Gaussian.** Requires factor 4c (inter-frame +coding) which is a Plan E2 or later. + +**Cite as R-10 in Plan E PR description and Plan G's `--mode splat` +threshold doc.** + +--- + +## 7. New commitments missing from both originals and from the merge + +### R-11 — Per-CTU encoder latency budget + +**Problem.** Neither doc nor the merge states ms-per-CTU at 60 fps 4K. +Without it, B:D-CODEC-8 / A:T-7 (no SIMD-batched encode) have no +falsifiability criterion. + +**Resolution.** Commit the budget; pin the SIMD-batched-encode debt +to the budget. + +**4K @ 60 fps frame budget:** + +```text +4K = 3840 × 2160 = 8.3 M pixels +60 fps = 16.67 ms/frame +At 8×8 leaf granularity (HEVC's smallest CU; the unit at which the +encoder's inner-loop work is paid): + 132,710 leaves/frame + (= 2,040 CTUs/frame at 64×64, × ~64 + leaves/CTU at maximum split depth; + 130,560 from clean 3840·2160/64, with + ~1.6 % bias for chroma alignment) +Per-leaf budget: 16.67 ms / 132,710 = 125 ns/leaf +``` + +**Encoder per-leaf breakdown (scalar reference, current):** + +| Stage | Scalar cost | SIMD-batched target | +|-------|-------------|---------------------| +| basin lookup (4096 entries, Hamming dist) | ~800 ns | ~50 ns (SIMD batched) | +| mode decide (Skip → Merge → Delta → Escape) | ~80 ns | ~80 ns (already cheap) | +| header pack (`pack_header`) | ~5 ns | ~5 ns | +| transform (A4, 8×8 DCT-II butterfly) | ~30 ns | ~30 ns | +| quantize (i8 round) | ~5 ns | ~5 ns | +| rANS encode (A7) | ~40 ns | ~40 ns | +| **Total per-leaf** | **~960 ns** | **~210 ns** | + +**At scalar reference (960 ns/leaf): 4K @ 60 fps requires 132,710 × +960 ns = 127 ms/frame. Misses 60 fps by 7.6×.** + +**At SIMD-batched (210 ns/leaf): 132,710 × 210 ns = 28 ms/frame. Misses +60 fps by 1.7×; needs further work but in the same order of magnitude.** + +**To hit 60 fps 4K real-time** requires the SIMD-batched-encode path +to land. **This pins B:D-CODEC-8 / A:T-7 from P2 to P1.** Plan A4-impl +and Plan A6 should both ship with SIMD-batched paths, not scalar +reference only. + +**Implication for Plan G.** The `--mode video` threshold (R-4) +includes a latency assertion: total encode time for the Big Buck Bunny +1080p clip must complete within (clip duration × 0.5). At 1080p that's +~32,400 leaves/frame × 210 ns × 30 fps = ~204 ms/sec, well within +budget. 4K is the stretch target. + +**Cite as R-11 in any encoder-path PR description; the latency +budget is the gate that determines whether SIMD-batched encode is P0 +or P1.** + +--- + +### R-12 — Streaming-buffer flush granularity + +**Problem.** Neither doc nor the merge says: per-CTU? per-frame? per-GOP? +Different answers make Plan A8 substantially different shapes. + +**Resolution.** Commit per-CTU as the default; per-bucket for Plan F. + +**Per-CTU flush (committed default; CTU = 64×64 cells, so 4096 cells/CTU, +2,040 CTUs/frame at 4K and ~510 CTUs/frame at 1080p):** + +```text +Buffer size: ~12 KB per CTU + = 4096 cells × avg 3 bytes (mode-distribution per R-10) +Flush rate: ~122,400 flushes/sec at 4K 60 fps (2,040 CTUs/frame × 60) + ~30,600 flushes/sec at 1080p 60 fps (510 CTUs/frame × 60) +Latency: sub-ms per CTU; consumer can start decoding the first + CTU before encoder finishes the frame +``` + +**Why per-CTU and not per-frame:** + +- per-frame buffer = ~1.5 MB; latency cost = 16.67 ms (one frame + latency added to encode-decode pipeline) +- per-GOP buffer = ~25 MB at 16-frame GOP; latency = 267 ms, + unacceptable for live attention / KV-cache use cases +- per-CTU = ~12 KB; latency = ~125 ns + +**Per-bucket override for Plan F (federated SGD):** + +```text +Bucket = 4096 weights (one BlockedGrid L1 block of gradients) +Buffer size: ~12 KB per bucket (same envelope as per-CTU) +Flush rate: per-iteration, per-bucket +Latency: bucket-local; all-reduce happens after bucket flush +``` + +**Wire format implication:** A8 frame header has a `FlushUnit` tag +(2-bit field): + +```text +FlushUnit::Ctu → 00 (default, video / splat / attention) +FlushUnit::Bucket → 01 (gradient SGD) +FlushUnit::Frame → 10 (offline batch encode) +FlushUnit::Reserved → 11 +``` + +**Plan A8 implementation note:** Flush granularity lives in the frame +header alongside `ConsumerProfile` (R-2) and the per-frame basin +codebook ref. Stream readers route on `FlushUnit` for buffer +allocation. + +**Cite as R-12 in Plan A8 PR description and Plan F PR description.** + +--- + +### R-13 — Basin codebook distribution policy for Plan F + +**Problem.** Plan F is 2 weeks × 2 workers; the merge doesn't address +whether the 4096-entry codebook is replicated across workers or +partitioned. Either answer is fine; not deciding makes Plan F +undefined. + +**Resolution.** Commit Option A (per-shard codebook) for Plan F v1; +list alternatives as Phase 3 exploration. + +**Option A — Per-shard codebook (Plan F v1, committed):** + +```text +Each worker holds 1 parameter shard, builds its own 4096-entry codebook +over its shard, encodes its gradients against its own codebook. +Wire format: each LeafCu carries (worker_id, basin_idx) in the per-frame +escape vector lookup. No cross-worker comm during codebook build. + +Pro: zero cross-worker codebook-build comm + worker independence + no global codebook drift +Con: loses cross-shard correlation (Merge-mode never fires across shards) + may compress worse than Option B by 1.5-2× per parameter +``` + +**Wire format extension for Option A:** + +```text +Frame header (per worker, per iteration): + FlushUnit::Bucket + ConsumerProfile::Gradient + WorkerId: u16 ← NEW: per-shard codebook index + CodebookHash: u64 ← integrity check + rANS frequency table +``` + +**Option B — Replicated codebook (alternative, Phase 3):** + +```text +One global 4096-entry codebook, all workers consume identical codebook. +Cross-worker codebook-build comm: one all-reduce per epoch. + +Pro: Merge-mode fires across shards (cross-parameter correlation) + better compression by 1.5-2× +Con: cross-worker codebook-build comm cost + codebook stale-ness if epoch boundary misses a parameter + complex resync after worker failure +``` + +**Option C — Hierarchical codebook (Phase 3+):** + +```text +Per-shard codebook + global "override" codebook (256 entries) for the +heavy-hitters that cross shards. +LeafCu first checks global override; falls through to per-shard. + +Pro: best compression in expectation (combines A and B) +Con: complex protocol; requires global hot-set tracking + worker-failure recovery non-trivial +``` + +**Plan F v1 commits Option A.** v2 (post-stability) evaluates Option B +empirically; v3 (research-grade) tries Option C. + +**Falsifies if.** Option A on BERT-large fine-tune fails to clear the +R-4 gradient threshold (8× compression at <0.5% loss delta). At that +point, Plan F v1 escalates to Option B in a follow-up PR. + +**Cite as R-13 in Plan F PR description.** + +--- + +## 8. The canonical contracts — concrete trait signatures + +All three plug-points (per M:E-E) get concrete signatures here. These +are the contracts Plans G / H / I / A4-design commit to in Phase 0. + +```rust +// ──────────────────────────────────────────────────────────────────── +// Plug-point 1: PredictiveSignal — what the consumer ships +// ──────────────────────────────────────────────────────────────────── + +/// Implemented by each domain's per-element data type: +/// - cognitive cell `Fingerprint` +/// - 3D Gaussian splat tuple +/// - attention slot `(Q, K)` pair +/// - gradient weight `(param_id, ∂L/∂w)` +/// +/// Single trait surface; ~50 LoC per consumer impl. +pub trait PredictiveSignal: Copy + Eq { + /// Basin codebook entry type. Often the same as `Self` (e.g. + /// cognitive: Fingerprint ↔ Fingerprint), but consumers like + /// gradient may use a tuple like `(GradientPattern, magnitude)`. + type Basin: Copy + Eq; + + /// Residual after subtracting the nearest basin. Should fit + /// in i8 when "Delta-mode worthy". + type Residual: Copy; + + /// What lives in the per-frame escape vector. Stock 3DGS: + /// `[f16; 48]` for SH≥L=2 + (μ, scale, rot, opacity, color). + /// Cognitive: `u64` Fingerprint. + /// Attention: `[f16; head_dim]`. + /// Gradient: `f32`. + type Escape: Copy; + + /// Find the nearest basin in the codebook. + /// Returns (basin_idx, residual). basin_idx must be ≤ MAX_BASIN_IDX. + fn nearest_basin(&self, codebook: &[Self::Basin]) -> (u16, Self::Residual); + + /// Is this residual small enough for Delta mode (fits i8)? + fn fits_delta(residual: &Self::Residual) -> bool; + + /// Encode residual as the u8 byte that goes into the LeafCu. + fn pack_residual(residual: &Self::Residual) -> u8; + + /// Type-erased neighbour reference (consumer-defined topology). + /// Codec NEVER interprets slot semantics — per R-9. + type NeighbourRef<'a>: Copy where Self: 'a; + fn neighbours(&self) -> [Option>; 4]; + + /// Convert self into the escape payload (for Escape-mode encode). + fn to_escape(&self) -> Self::Escape; +} + +// ──────────────────────────────────────────────────────────────────── +// Plug-point 2: Basis + LinearReduce — per R-1 +// ──────────────────────────────────────────────────────────────────── + +pub trait Basis { + fn dim(&self) -> usize; + fn apply(&self, src: &[T], dst: &mut [T]); + fn invert(&self, src: &[T], dst: &mut [T]); +} + +pub trait LinearReduce { + type Symbol: Copy; + type Output; + type Basis: Basis; + + fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output; + fn reduce_batch( + &self, + src: &[&[Self::Symbol]], + basis: &Self::Basis, + ) -> Vec; +} + +// Concrete impls (each ~30-80 LoC, lives in consumer crate): +// - IdentityBasis in ndarray-codec +// - DctIIBasis in ndarray::hpc::fft +// - HadamardBasis in ndarray::hpc::fft +// - AdamPrecondBasis in burn-codec +// - KFACBlockBasis in burn-codec +// - ShSpectralBasis in ndarray::hpc::splat3d +// +// - AlphaCompositeReduce in ndarray::hpc::splat3d +// - RansEncodeReduce in ndarray-codec::ans +// - SumReduce in ndarray-codec::reduce +// - SoftmaxReduce in ndarray::hpc::activations + +// ──────────────────────────────────────────────────────────────────── +// Plug-point 3: CurveOrder — per M:E-B +// ──────────────────────────────────────────────────────────────────── + +/// Space-filling curve that linearises a multi-dim consumer payload +/// into 1D for codec processing. The codec sees only the 1D stream. +/// +/// Concrete impls (each ~20-40 LoC): +/// - RasterScan for cognitive cells +/// - MortonOrder for 3DGS in 3D +/// - HilbertOrder for splat in 3D (alternative) +/// - TokenSequence for attention +/// - LayerSequence for gradient +pub trait CurveOrder { + /// Total points on the curve. + fn len(&self) -> usize; + /// (i+1)-th neighbour of point i along the curve, or None at boundary. + fn next(&self, i: usize) -> Option; + /// Per-point coordinate (in consumer's native dimensionality). + fn coord(&self, i: usize) -> [i32; N]; +} + +// ──────────────────────────────────────────────────────────────────── +// Plug-point 4 (lower priority, M:E new): RdoMetric +// ──────────────────────────────────────────────────────────────────── + +pub trait RdoMetric { + type Distortion: Copy + PartialOrd; + fn distortion(&self, reconstructed: &[u8], original: &[u8]) -> Self::Distortion; + fn rate(&self, bits_used: usize) -> f32; + fn cost(&self, d: Self::Distortion, r: f32, lambda: f32) -> f32; +} + +// Concrete impls (consumer crate): +// - PsnrMetric for video +// - SsimMetric for splat +// - LossDeltaMetric for gradient +// - KlDivergence for attention +``` + +**The trait surface is the contract.** Plan I (3 days, Phase 0) +implements `PredictiveSignal` for cognitive cells as the reference +consumer. Plan A4-design (1 day) commits the `Basis` + `LinearReduce` +shapes. Plans D / E / F each ship one `impl PredictiveSignal for ...` +plus their `CurveOrder` / `Basis` / `RdoMetric` impls. + +--- + +## 9. Falsifiability matrix + +Every load-bearing claim from the canon and from this doc has a +test, a metric, and a pass condition. The matrix is the audit +that decides whether each holy-grail claim is demonstrated. + +| Claim | Source | Test | Metric | Pass condition | +|-------|--------|------|--------|----------------| +| M:H-1 / HG1 (4 loads → 1 codec) | canon | Plan G binary runs all 4 modes | 4 Lance columns emitted | All 4 emit successfully | +| M:H-2 / H-2 (transform = optimizer) | canon | A4 + burn-codec ship | AdamPrecondBasis impls LinearReduce | Bench Adam-as-codec on BERT-glue | +| M:H-3 / HG3 (bit-exact attention) | canon | Plan D ships | KV cache compresses + RULER score | ≥4× ratio, ≤0.5% accuracy loss | +| M:H-4 / H-4 (Shannon-optimal grad) | canon | Plan F + signSGD bench | rANS frequency-table entropy match | Empirical entropy within 5% of H(p) | +| M:H-5 / HG4 (ZeRO generalisation) | canon | Plan F + DeepSpeed bench | 8-16× compression at 16+ workers | ≥8× at ≤0.5% loss delta | +| M:H-6 / HG2 (sub-1-bit/Gaussian) | canon + R-10 | Plan E + offline codebook | bits/Gaussian on Mip-NeRF 360 | Near: ≤4 bit; stretch: ≤1 bit | +| M:H-7 / HG5 (Lance substrate) | canon | Plan H + Plan I land | 4-load Lance columns same schema | Schema check; per-load `read_codec_lance` | +| M:H-8 / H-6 (64×64 universal) | canon | M:E-G `Ctu` | Compiles for N ∈ {16, 32, 64} | All 3 sizes pass codec tests | +| M:H-9 / HG6 (splat3d × x265 one lib) | canon | Plan E + ndarray-codec | 1 binary, 1 dep tree | Binary size <10 MB; deps tree-clean | +| M:H-NEW-1 (single binary, 4 loads) | canon | Plan G binary | `codec-bench --mode {video,splat,kv,grad}` | Executes each in <60s on ref data | +| M:H-NEW-2 (~2 KLoC stack) | canon + R-3 | LoC audit per PR | Generic-codec LoC | <1500 LoC; per-consumer <200 LoC | +| R-1 (LinearReduce shape correct) | this doc | Plan A7 builds against the trait | Trait isn't subclassed by A7 | A7 uses public trait surface only | +| R-2 (bit 14 consumer-typed) | this doc | Plan A8 ships ConsumerProfile demux | All 4 profile decoders run | 4 profile-specific tests pass | +| R-3 (LoC envelope) | this doc | LoC audit per PR | Cumulative generic LoC | <1500 after A4-A8 | +| R-4 (Plan G thresholds) | this doc | `codec-bench --threshold` flag | Ratio + quality + LoC | All 4 thresholds clear | +| R-5 (DCT crossover at 64) | this doc | A4-impl bench at varying N | Per-block vs batched dispatch time | Crossover within [32, 96] empirically | +| R-6 (SSD via VNNI ≥30×) | this doc | `batched_ssd_search` micro-bench | Cycles per 16×16 ME candidate | ≤4 cycles per 256-d dot (VNNI) | +| R-7 (tropical-GEMM ≥10×) | this doc | Plan A6 partition bench | Per-CTU partition RDO time | ≥10× over naive recursive on Zen 4 | +| R-8 (Plan G is confidence gate) | this doc | Phase order | Plan G ships before A7 | A7 PR doesn't merge until Plan G binary green | +| R-9 (topology-free) | this doc | grep audit | Codec body has no spatial-semantic refs | `grep -rE 'North\|East\|West\|South' src/hpc/codec/*.rs` returns only test/doc | +| R-10 (4 bit/Gaussian floor) | this doc | Plan E bench | bits/Gaussian on Mip-NeRF 360 | ≤4 bits/Gaussian without offline codebook | +| R-11 (4K 60fps SIMD-batched) | this doc | Plan G video latency assert | Per-CTU encode time | ≤210 ns/CTU on Sapphire Rapids | +| R-12 (per-CTU flush) | this doc | A8 frame-header parse + decode | First-CTU latency | First CTU decodable before frame complete | +| R-13 (Option A per-shard) | this doc | Plan F on BERT-glue | 8× compression + accuracy | Holds; else escalate to Option B | + +**Every row of this matrix is a test.** Plan G's bench harness binary +emits a JSON report containing the actual measurement for each row; +the merge job for Phase 2 consumer PRs reads that report and gates on +pass-fail. + +--- + +## 10. Sequencing diagram (canonical) + +```text + T+0 + ┌─────────────────────────────────────────────┐ + │ PHASE 0 — substrate gates │ + │ │ + │ Plan H — extract ndarray-codec (3d) │ + │ Plan I — PredictiveSignal trait (3d) │ + │ Plan A4-design — Basis + LinearReduce (1d) + │ Plan G — multi-domain bench (2w) ★ GATE │ + └─────────────────────┬───────────────────────┘ + │ + ▼ + Plan G binary green; thresholds testable + │ + ▼ + ╔════════════════════════════════════════════════╗ + ║ Plan A7 — rANS (1.5 w) CRITICAL PATH ║ + ║ gates on Plan G; ships against R-1 trait ║ + ╚════════════════════════╦═══════════════════════╝ + │ + ┌──────┬───────┬──┴───┬──────┬──────────┐ + ▼ ▼ ▼ ▼ ▼ ▼ + Plan B Plan A4 Plan A6 Plan A8 Plan C (R-11 SIMD + (inter) (impl) (RDO) (stream) (EWA) batch path + 1 wk 1 wk 1 wk 1 wk 1 wk lands in + └──────┴───────┴──┬───┴──────┘ each) + ▼ + ┌─────────────────────────────────┐ + │ PHASE 1 closes — codec mech │ + │ complete; thresholds re-run │ + └────────────────┬────────────────┘ + │ + ┌────────┬───────┴────────┐ + ▼ ▼ ▼ + Plan E Plan D Plan F (after D) + (3DGS (attention (federated SGD, + 3 wk×2) 2 wk×2) 4 wk×2, R-13) + │ │ │ + └────────┴──────┬─────────┘ + ▼ + Plan G runs all 4 thresholds + │ + ▼ + HG1 / HG6 / M:H-NEW-1 demonstrated + (or specific claims falsified) +``` + +★ **Gate semantics.** Plan G is a *blocking* gate: Plan A7 cannot +merge until Plan G's bench-harness binary is green (i.e., runs all 4 +modes end-to-end, even if compression ratios are below threshold — +those calibrate in Phase 1). The threshold pass-fail bind on Phase 2 +consumer PRs, not on Phase 1 codec PRs. + +--- + +## 11. End-state recap and exit conditions + +**The end state, recapped from §1:** + +After ~10.5 weeks of trajectory work: + +1. One binary `codec-bench` runs four modes end-to-end (HG1 demonstrated). +2. Generic codec LoC ≤1500 (R-3 / M:H-NEW-2 demonstrated). +3. Each consumer ≤200 LoC of trait impl (R-3 demonstrated). +4. Compression ratios meet R-4 thresholds for all four loads: + - Video: ≥0.95× x265 ultrafast at parity PSNR + - Splat: ≥30× over Inria PLY-trim at SSIM parity + - KV cache: ≥4× over FP16 raw at ≤0.5% RULER loss + - Gradient: ≥8× over signSGD at ≤0.5% loss delta +5. `ndarray-codec` crate extracted (M:E-D / Plan H demonstrated). +6. Three traits land at type-erased boundaries (Plan I + A4-design). +7. CLAUDE.md "Architecture Rule" lists 5 categories (M:T-3 closed). + +**Exit conditions per claim:** + +- **M:H-1 met** when `codec-bench --mode {video, splat, kv, grad}` + emits 4 Lance columns within the LoC envelope. +- **M:H-2 met** when AdamPrecondBasis impl ships in burn-codec and + reduces BERT-glue training to within 5% of stock Adam loss curve + using the same `LinearReduce` trait surface as A7 rANS. +- **M:H-3 met** when Plan D ships and Llama-3 inference on RULER 64K + passes R-4 threshold. +- **M:H-4 met** when Plan F + signSGD bench shows rANS frequency-table + entropy within 5% of empirical H(p) for ≥3 layer types. +- **M:H-5 met** when 16-worker BERT fine-tune via Plan F clears 8× + compression at ≤0.5% loss delta on GLUE. +- **M:H-6 met** when Mip-NeRF 360 garden scene compresses to ≤4 + bits/Gaussian (near target per R-10). +- **M:H-7 met** when each of the 4 loads writes to a Lance column + with identical schema (one read path serves all). +- **M:H-8 met** when `Ctu<16>`, `Ctu<32>`, `Ctu<64>` all pass codec + tests in a single build. +- **M:H-9 met** when one binary `codec-bench` ships <10 MB with all + 4 modes wired (no `--feature splat` gating; everything compiled in). + +**If any claim fails its exit condition**, the corresponding consumer +PR scopes down (e.g., M:H-6 stretch sub-1-bit/Gaussian fails → ship +4-bit/Gaussian as near-term reality, mark sub-1-bit as Plan E2). The +falsifiability is the point; not every claim has to hold for the +architecture to be valuable. + +**Worst-case fallback.** If even Plan G itself fails (e.g., trait +shape from R-1 turns out wrong under bench pressure), the canon's +M:E-A factoring is wrong. Recovery: revise traits before A7 ships. +Cost: ~3 days extra in Phase 0. **This is exactly why Plan G goes +first — to catch this before A7 wastes 1.5 weeks against the wrong +shape.** + +--- + +## 12. Compaction-preservation contract + +When this doc is summarised across context windows, preserve at minimum: + +1. **The five "merged well" items** from §3 (M:E-A / M:E-D / M:E-G / + M:E-I / M:E-F) — these are the canon's load-bearing synthesis. +2. **The thirteen resolutions** R-1 through R-13 with their summary + commitments: + - R-1: `LinearReduce` two-trait shape + - R-2: bit 15 universal, bit 14 consumer-typed + - R-3: ≤1500 LoC generic, ≤200 LoC per consumer + - R-4: 4 threshold pairs (video, splat, kv, grad) + - R-5: DCT crossover ~64 blocks + - R-6: SSD via VNNI ≥30× over SAD + - R-7: tropical-GEMM partition O(4^d) → O(d²) + - R-8: Plan G is confidence gate + - R-9: topology-FREE codec layer + - R-10: ~4 bits/Gaussian near target, ~1 bit stretch + - R-11: 210 ns/CTU SIMD-batched encode + - R-12: per-CTU flush default; per-bucket Plan F + - R-13: Option A (per-shard codebook) for Plan F v1 +3. **The trajectory** from §2 — Phase 0 → A7 → parallelise → Phase 2 +4. **The five-category architecture** including `ndarray-codec` +5. **The four traits** as the canonical contracts: + `PredictiveSignal`, `Basis`, `LinearReduce`, `CurveOrder` + (plus `RdoMetric` for A6) +6. **Plan G as the gate** — A7 cannot merge until Plan G binary green +7. **The falsifiability matrix in §9** — every claim has a test; + not every claim will pass; that's the design + +**Citation IDs in this doc** (R-1 .. R-13) are stable. Canon IDs +(M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E-*, A:H-*, A:T-*, B:E-*, B:HG-*, +B:D-*) remain stable per canon's §10. Append, never renumber. + +--- + +## 13. The single load-bearing paragraph + +If you read nothing else: + +> *The merged canon committed to the right architectural synthesis +> (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts +> unsigned. This doc commits them: `Basis` + `LinearReduce` are +> two traits not one (R-1); bit 14 of the leaf header is consumer- +> typed and bit 15 universal (R-2); generic codec body ≤1500 LoC +> with ≤200 LoC per consumer (R-3); four threshold pairs gate +> Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) → +> Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel +> (3 wks); end state is one binary, four loads, ~2 KLoC stack +> demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim +> in §9 has a test; Plan G's bench-harness binary is the audit. The +> falsifiability is the point.* + +--- + +_Last edit: 2026-05-22 — companion to merged canon `bc9da4ad`. +Edit when an R-N resolves to ship, when a falsifiability test pin +shifts, or when an exit condition closes. Renumber only by appending._ diff --git a/.claude/knowledge/pr-x12-substrate-merged-canon.md b/.claude/knowledge/pr-x12-substrate-merged-canon.md index b6948cca..81a20b67 100644 --- a/.claude/knowledge/pr-x12-substrate-merged-canon.md +++ b/.claude/knowledge/pr-x12-substrate-merged-canon.md @@ -13,6 +13,24 @@ Two independent sessions reached the same architectural claim — *PR-X12 is the universal predictive-coder substrate that subsumes four industries* — through different routes. Each session surfaced angles the other missed. This doc is the **canonical fusion**, designed to be the single doc a fresh agent reads to inherit the entire claim. +> **Post-merge resolutions index** (2026-05-22): the claims and tensions in this doc were further formalised into 13 numbered resolutions R-1..R-13. See `pr-x12-canon-resolutions-delta.md` for the canonical list. Cross-section pointers inline below: +> +> - §M:E-A (Mode-decide + reduce pipeline kernel) → **R-1** (`LinearReduce` + `Basis` trait split) +> - §M:E-G (`Ctu`) and §M:E-J (header bits 14-15) → **R-2** (16-bit header layout pinned), **R-8** (Plan G arch-conditional gate) +> - §M:E-H (D-STACK-13 bench harness as P0) → **R-4** (codec-bench in Plan G), **R-11** (latency assertions per arch) +> - §M:H-NEW-2 (codec body LoC envelope ≤ 1500) → **R-3** (LoC audit rule, scope-fence definition) +> - §M:H-6 (sub-1-bit basin + Gaussian-tail rANS) → **R-10** (commitment to sub-1-bit-per-token where source supports it) +> - §M:E-D (codec breaks ndarray ↔ lance-graph cycle) → **R-7** (tropical-GEMM lives in lance-graph, called from codec — dep direction allowed) +> +> Perspective companions written 2026-05-22: +> - `pr-x12-x265-blasgraph-gemm.md` — every codec inner loop as a GEMM +> - `pr-x12-x266-3dgs-spacetime-upscaling.md` — Basis + EWA splat → free space-time codec upscaling +> - `pr-x12-woa-multiarch-orchestration.md` — how WoA / q2 / consumer crates inherit the substrate's per-arch dispatch +> - `pr-x12-anti-neural-lookup-inversion.md` — lookup tables as frozen 1-layer NNs; the codec is the anti-neural codec +> - `pr-x12-gguf-llm-weights-encoding.md` — the fifth load: GGUF attention/FFN tensors as Skip/Merge/Delta/Escape +> - `pr-x12-bgz-jc-substrate-synergies.md` — **CRITICAL**: the PR-X12 substrate is *already implemented* in `lance-graph/crates/{bgz17,highheelbgz,bgz-tensor}`, formally proven in `lance-graph/crates/jc`. Skip/Merge/Delta/Escape ≡ Scent/Palette/ZeckBF17/Full. 4096-entry basin ≡ HHTL 16×16×16 lattice. bgz-hhtl-d ships LLM weight encoding at 343:1 on Qwen3-TTS-1.7B today. Two gaps identified: `jd-nd` (ndarray-side proof crate) and Cronbach/ICC encoding-reliability research crate. +> - `pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md` — **substrate bindings**: cam_pq trains all bgz palettes (CAM bytes map onto HHTL bits 1:1); sigker provides Chen-Lyons signature uniqueness (arXiv:2006.14794, Hambly-Lyons 2010, CST 2021) as the formal-correctness bedrock cited by jc Pillar 11 (DEFERRED); dn_tree + merkle_tree are the online-update + integrity substrate for R-13 SharedClusterWide. **Seven new gaps catalogued (G-1..G-7), ~11-17 weeks of wiring** to fully bind. R-14 (formal correctness) + R-15 (signature basis) candidates surfaced. + The merge is not a re-statement. **It is the new epiphanies that emerge only when both halves sit side by side.** They get their own §3. ### Identity-preservation rules @@ -122,6 +140,8 @@ These are the insights that emerge **only when both docs sit next to each other* ### M:E-A — Mode-decide + reduce IS the universal pipeline kernel +> [Formalised post-merge as **R-1**: `LinearReduce` decomposes into `Basis` (basis-as-data) + `Reducer` (reduction operator). See `pr-x12-canon-resolutions-delta.md` §R-1.] + A's E-4 (transform IS optimizer preconditioner) + B's E9 (splat3d × codec = same pipeline shifted 90°) combined: The *reduction operator* in B's "unified mode-decide+reduce trait" is **exactly the basis-times-source product** A's transform claim points at: @@ -258,6 +278,19 @@ pub trait PredictiveSignal { ### M:E-J — The reserved header bits 14-15 carry causal-edge metadata for free +> [Formalised post-merge as **R-2**: 16-bit header bit layout pinned — +> bits 0-1 = `header_kind`, bits 2-13 = `basin_index`, +> **bit 15 = UNIVERSAL "has inter-tier reference"** (identical across +> all four consumers; A3-inter cross-tier link), +> **bit 14 = CONSUMER-TYPED via the frame header's `ConsumerProfile` +> tag** (cognitive: Pearl-rung high bit; video: reserved=0; +> splat: LOD-cascade-source flag; gradient: worker-shard parity). +> Leaf size (8/16/32/64) is encoded structurally via M:E-G's +> `Ctu` at the type level, NOT in header bits 14-15. The +> causal-tier reading below is the historical motivation for bit 14; +> R-2 generalises it to the four-consumer demux. See +> `pr-x12-substrate-canon-resolutions.md` §R-2.] + A's E-15 (reserved bits 14-15 are inter-tier link) + A's T-22 (causal-edge v2 mantissa: Intervention=+6, Counterfactual=-6): Two reserved bits = 4 states. The natural 4-state encoding for cognitive content: @@ -290,6 +323,8 @@ Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the **M:H-6** *(from B:HG2 alone)* — Sub-1-bit-per-Gaussian 3DGS compression. 30-60× over current state-of-the-art PLY-trim. A 1M-Gaussian scene = ~500 KB, streamable as video. **Most economically valuable single claim** — directly attacks the bandwidth bottleneck for cloud-rendered 3D content. +> [Formalised post-merge as **R-10**: PR-X12 commits to sub-1-bit-per-token via Gaussian-tail rANS where the source distribution supports it (basin codebook + heavy-tailed residual). See `pr-x12-canon-resolutions-delta.md` §R-10 for the falsification path (Plan G entropy bench).] + **M:H-7** *(merge of A:H-1 + B:HG5)* — Lance column substrate identity becomes ground truth. `SpoDistanceMatrices` at 611M lookups/sec serves as universal palette codebook lookup across all four loads. ndarray = hardware, ndarray-codec = compression substrate (new, per M:E-D), lance-graph = thinking, causal-edge = protocol, p64 = convergence. Five-category architecture. **M:H-8** *(from A:H-6 alone)* — 64×64 CTU is the right unit for both 4K video luma blocks and 7B-parameter LLM head dim × 16 heads. Convergent evolution from two unrelated industries arriving at the same arithmetic block size. @@ -302,6 +337,8 @@ Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the **M:H-NEW-2** — `trait PredictiveSignal` + `trait LinearReduce` + `trait CurveOrder` factor the codec into three plug-points (per M:E-E + M:E-A + M:E-B). The codec body is `<150 LoC of generic glue. Domain consumers ship `<200 LoC` of trait impls. **Total stack for all four industries: ~2 KLoC.** Compared to ~50 KLoC per-domain implementations elsewhere. The 25× code-density delta is the architectural payoff that justifies the eight sub-cards. +> [Formalised post-merge as **R-3**: the LoC envelope is `≤ 1500 lines of generic codec body` (revised upward from `<150` for realism after counting glue), enforced via an explicit scope-fence audit rule in CI. The substrate (`ndarray::hpc::blas_level2` etc.) is excluded from the budget. See `pr-x12-canon-resolutions-delta.md` §R-3 for the exact audit definition.] + --- ## 5. Unified integration plan (canonical sequencing) diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md new file mode 100644 index 00000000..0da19ed7 --- /dev/null +++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md @@ -0,0 +1,345 @@ +# PR-X12 — WoA Orchestration & Multi-Arch Dispatch Lens + +> Date: 2026-05-22 +> Status: **perspective doc** — examines how the orchestration crates (`woa-rs`, `woa`, `q2`, `surrealdb`, `MedCare-rs`, `smb-office-rs`) consume the PR-X12 substrate, and how PR-X12's per-arch dispatch decisions (R-4, R-5, R-11) generalise to the entire HPC stack. +> +> Premise: PR-X12 is not just a codec project. It's the **per-arch dispatch contract** that every consumer above `ndarray` will inherit. The codec is the first non-trivial test of whether that contract holds. + +--- + +## 0. Thesis + +**Every consumer crate dispatches kernels across {Intel SPR, AMD Zen 4-5, ARM Graviton 3-4, Apple Silicon, NVIDIA Hopper-Blackwell} via the same `ndarray::hpc` capability traits.** PR-X12's per-arch DCT crossover (R-5) and latency assertion (R-11) aren't codec-specific — they're the canonical shape of how any consumer crate gates fast-paths. If the codec's per-arch story is wrong, the entire HPC consumer ecosystem inherits the bug. + +--- + +## 1. The orchestration problem PR-X12 must solve + +In a real deployment, a `woa-rs` agent processing a request might: + +1. Receive a video stream (codec: PR-X12) +2. Run perception model on extracted frames (`burn`/`candle`) +3. Query graph state (`lance-graph::blasgraph` tropical-GEMM) +4. Update node-local cache (`surrealdb`) +5. Emit response stream (codec again) + +Steps 1, 2, 3, 5 all hit the `ndarray::hpc` BLAS layer. Each step has a per-arch fast-path: SPR uses AMX, Zen 4 uses VNNI+AVX-512, Graviton 3 uses SVE2, Apple uses NEON/AMX, Hopper uses tensor cores. **None of the consumer crates know which fast-path is active.** They call `blas_level2::batched_gemm` and the substrate dispatches. + +This is what makes PR-X12's R-4 / R-11 architecture-conditional bench gates *substrate policy*, not codec policy. R-4 says "Plan G clears at most on 1 of: SPR / Zen 4 / Graviton 3 / Apple M-class," and R-11 adds latency assertions. That same gate structure applies to: + +- `burn` model serving (forward pass per arch) +- `candle` quantized inference (q4/q8 per arch) +- `lance-graph::blasgraph` graph queries (tropical-GEMM per arch) +- `surrealdb` HNSW search (vector dist per arch) +- `MedCare-rs` DICOM transform (DCT + wavelet per arch) +- `smb-office-rs` OCR + layout (conv + attention per arch) + +Every one of these inherits the dispatch contract. PR-X12 is the first to make it visible. + +--- + +## 2. WoA's place in the stack + +```text +┌────────────────────────────────────────────────────┐ +│ WoA agent (woa-rs, woa) │ +│ Request orchestration, scheduling, transport │ +└────────────────────┬───────────────────────────────┘ + │ async dispatch, no SIMD + ▼ +┌────────────────────────────────────────────────────┐ +│ Domain consumer crates │ +│ ndarray-codec, burn, candle, lance-graph, │ +│ surrealdb, MedCare-rs, smb-office-rs │ +│ (Each: ~1-5K LoC of generic code + traits) │ +└────────────────────┬───────────────────────────────┘ + │ capability traits, target_feature + ▼ +┌────────────────────────────────────────────────────┐ +│ ndarray::hpc (the dispatch substrate) │ +│ blas_level{1,2,3}, fft, cam_pq, activations, │ +│ simd_int_ops, bf16_tile_gemm │ +│ (~15K LoC; PR-X12 ratchets at this layer) │ +└────────────────────┬───────────────────────────────┘ + │ per-arch SIMD intrinsics + ▼ +┌────────────────────────────────────────────────────┐ +│ Hardware: SPR / Zen / Graviton / Apple / Hopper │ +└────────────────────────────────────────────────────┘ +``` + +**WoA never touches `target_feature` directly.** Its job is async scheduling, transport (Q2 over QUIC), persistence (surrealdb), and policy. The SIMD dispatch happens one layer below, in the consumer crates calling `ndarray::hpc`. + +This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tractable. The codec crate doesn't dispatch — it calls the substrate. WoA doesn't dispatch — it calls the codec, which calls the substrate. Per-arch code lives once, in `ndarray::hpc`. + +--- + +## 3. Per-arch dispatch as a substrate property + +The PR-X12 substrate (per merged canon §M:E-G, §M:E-H, R-4, R-5, R-11) implements per-arch dispatch via three mechanisms: + +### 3.1 Compile-time `target_feature` + +```rust +// In ndarray::hpc::blas_level2::batched_gemm: + +#[cfg(target_arch = "x86_64")] +mod x86_dispatch { + #[target_feature(enable = "avx512f,avx512bw,avx512vnni")] + pub unsafe fn batched_gemm_vnni(...) { /* VNNI path */ } + + #[target_feature(enable = "amx-tile,amx-int8,amx-bf16")] + pub unsafe fn batched_gemm_amx(...) { /* AMX path */ } +} + +#[cfg(target_arch = "aarch64")] +mod arm_dispatch { + #[target_feature(enable = "sve2")] + pub unsafe fn batched_gemm_sve2(...) { /* SVE2 path */ } + + #[target_feature(enable = "neon,fp16")] + pub unsafe fn batched_gemm_neon_fp16(...) { /* Apple Silicon */ } +} +``` + +### 3.2 Runtime feature detection (cached at process start) + +```rust +// In ndarray::hpc::capability: +pub static CAP: OnceLock = OnceLock::new(); + +pub struct HwCaps { + pub has_amx: bool, + pub has_vnni: bool, + pub has_sve2: bool, + pub has_neon_fp16: bool, + pub l1_cache_size: usize, + pub vec_width_bits: u16, + // ... more as new features land +} + +pub fn batched_gemm(input: ...) { + let caps = CAP.get().unwrap(); + if caps.has_amx { unsafe { batched_gemm_amx(input) } } + else if caps.has_vnni { unsafe { batched_gemm_vnni(input) } } + else if caps.has_sve2 { unsafe { batched_gemm_sve2(input) } } + // ... + else { batched_gemm_scalar(input) } +} +``` + +### 3.3 Per-arch tunable crossover (R-5 generalised) + +Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch: + +```rust +const DCT_BATCH_CROSSOVER: usize = match Arch::CURRENT { + Arch::SapphireRapids => 64, // AMX wins above this + Arch::IceLakeServer => 32, // AVX-512 narrower; lower crossover + Arch::Zen4 => 96, // Zen's AVX-512 emulation widens crossover + Arch::AppleM3 => 256, // NEON's narrower; only worth at large N + Arch::GravitonV3 => 128, // SVE2 mid-range + Arch::Generic => usize::MAX, // Always scalar fallback +}; + +pub fn dct_apply(input: &[i16], output: &mut [i16]) { + if N >= DCT_BATCH_CROSSOVER { + unsafe { dct_gemm_path(input, output) } + } else { + dct_butterfly_path(input, output) + } +} +``` + +R-5 commits these crossovers as **bench-tunable constants**, not hand-guessed numbers. Plan G's codec-bench includes a calibration sub-target that emits the right `const` values per arch via build script. + +--- + +## 4. The latency budget split — codec / orchestration / network + +A WoA agent processing a video stream end-to-end has three latency contributors: + +```text +Total latency = T_codec + T_orchestration + T_network +``` + +PR-X12 (R-11) commits a budget on `T_codec`: + +| Stage | Budget (per encode) | Source | +|---|---|---| +| Codec encode | ≤ 0.5× wall-clock for 1080p @ 30 fps | R-11 | +| Codec decode | ≤ 0.25× wall-clock for 1080p @ 30 fps | R-11 | +| Block-level ME | ≤ 10 µs per CTU on SPR | R-11 spec, calibrated by codec-bench | +| Tropical-GEMM RDO | ≤ 50 µs per CTU on SPR | derived from R-7 cost analysis | +| Basis::apply (DCT) | ≤ 2 µs per 32×32 block on SPR | derived from R-5 | + +**WoA's contract:** if any of these are violated on a supported arch, the consumer can either accept the slowdown or refuse to schedule the request. WoA has visibility into per-arch dispatch quality via the substrate's metrics endpoint: + +```rust +ndarray::hpc::metrics::stage_latency_p99(stage: StageId) -> Duration; +``` + +This is wired through to woa-rs's request scheduler. If `T_codec` p99 exceeds budget, woa-rs can: + +- Reroute to a different node (better hardware available) +- Degrade the request (lower codec quality, smaller batch) +- Fail fast with a clear error (don't tie up the client) + +**Without R-11's commitment to latency assertions in CI, this whole chain falls over.** The substrate-to-orchestrator contract is empty unless someone ratchets on it. + +--- + +## 5. Federated codebook policy (R-13) — the orchestration angle + +R-13 commits that the codec's 4096-entry basin codebook can be either: + +- **Per-instance** (each PR-X12 encoder builds its own from the input stream) +- **Federated** (a cluster of encoders shares a codebook, periodically updated) +- **Per-domain pretrained** (a hand-curated codebook ships with the binary for {video, text, image, audio} domain segments) + +The orchestration layer (WoA / Q2) is where federation policy lives. Specifically: + +```rust +// In q2 (transport / coordination): +pub enum CodebookPolicy { + LocalEphemeral, // each encoder owns its codebook + SharedClusterWide { ttl: Duration }, // gossip protocol distributes + SharedRegional { region: Region }, // edge-tier sharing + PretrainedStatic { id: BlobId }, // immutable, served from CAS +} + +impl WoaAgent { + fn select_codebook(&self, request: &Request) -> CodebookHandle { + match request.payload_class() { + PayloadClass::HumanText => self.pretrained("english-text-v3"), + PayloadClass::VideoFrame => self.shared_cluster_wide(), + PayloadClass::EphemeralBlob => self.local_ephemeral(), + // ... + } + } +} +``` + +**R-13 says:** the codec layer exposes the basin-codebook as a swappable handle. The orchestration layer chooses which codebook to use per request. PR-X12 ships with the substrate hook; q2 owns the policy. + +**Why this matters for PR-X12 scope:** the basin-codebook is currently a hard-coded 4096-entry array per encoder. R-13 commits to making it swappable (replacing the array reference with a handle/trait) — this is a ~30-line change in the codec crate, not a 300-line rewrite. The federation logic itself lives in q2, outside PR-X12's body. + +This is a model for many features that look "out of scope" for PR-X12 but actually need a tiny anchor in PR-X12 to be reachable later: + +- Federated codebook → swap pointer to handle (R-13) +- 3DGS scene anchor → add SceneAnchor header_kind (x266 doc) +- GPU offload → add `Reducer::dispatch_target() -> DispatchTarget` (Plan E adjacent) +- Speculative decode → add `Frame::is_speculative()` bit in header reserved field + +None of these are PR-X12 scope. All of them require ≤50 LoC of "anchor" in PR-X12. The discipline of M:H-NEW-2 + R-3's LoC envelope is what makes future anchoring possible without forking the codec. + +--- + +## 6. Cross-arch determinism — the consumer's hardest requirement + +A WoA agent that runs on SPR in the data center and Apple Silicon at the edge must produce **the same answer** for the same input. Floating-point order-of-operations differs across SIMD widths, so naive parallel reductions break this. + +PR-X12's `LinearReduce` abstraction (R-1, M:E-A) is the answer: + +```rust +pub trait Reducer { + fn reduce_pair(&self, lhs: T, rhs: T) -> T; +} + +// For bit-exact reduction across archs: +pub struct OrderedKahanReducer; + +impl Reducer for OrderedKahanReducer { + fn reduce_pair(&self, lhs: f32, rhs: f32) -> f32 { + // Kahan compensated sum, with explicit left-to-right order + // Same bit pattern on every arch + kahan_add(lhs, rhs) + } +} +``` + +The codec uses `OrderedKahanReducer` for any sum that crosses a wire-format boundary — basin assignment, rate-distortion accumulation, transform coefficient sum. Same input → same bits, regardless of arch. Determinism is paid for in some throughput (Kahan is ~3× slower than naive sum), but it's a tunable choice per use site. + +**Without R-1's basis/reducer split, cross-arch determinism is a substrate-wide audit nightmare.** With it, the audit is per-use-site: grep for places that use `NaiveSimdReducer` on cross-wire-format paths, replace with `OrderedKahanReducer`. + +--- + +## 7. Failure modes and mitigations + +### 7.1 ABI drift between substrate and consumer + +If `ndarray::hpc::blas_level2::batched_gemm`'s signature changes, every consumer breaks. **Mitigation:** R-3's LoC envelope explicitly excludes the substrate API from "codec body LoC" — meaning the API gets the same review scrutiny as a public crate API. SemVer applies. + +### 7.2 Per-arch CI flake + +R-4 commits codec-bench gates the merge on at most 1 arch. **Mitigation:** the bench passes on the canonical arch (SPR), and the other arches are *informational* on each PR but blocking on release tag. This is the standard "fast PR / slow release" gate pattern. + +### 7.3 Version skew across the WoA fleet + +A cluster running mixed PR-X12 versions could produce inconsistent codec output. **Mitigation:** the wire format header includes a version byte (one of M:E-J's reserved bits in future revisions); decoder rejects incompatible streams with a clean error. The federation gossip in q2 propagates the codec version as part of the node descriptor. + +### 7.4 Federated codebook poisoning + +If R-13's federated codebook is updated by a compromised node, the cluster compresses badly. **Mitigation:** codebook updates are signed; q2 ignores updates not signed by quorum. Out of PR-X12 scope (it's a transport/auth concern) but the substrate exposes the hook. + +--- + +## 8. The consumer crates in detail + +Quick tour of what each crate inherits from PR-X12 substrate decisions: + +### 8.1 `burn` (model training/inference) + +Uses `blas_level3::gemm` for matrix multiply, `activations` for nonlinearities, `cam_pq` for KV cache compression. Per-arch dispatch via the same target_feature paths. Will benefit directly from PR-X12's R-4 / R-11 latency-assertion infrastructure when it lands (burn has wanted this for ~14 months). + +### 8.2 `candle` (quantized inference) + +Heavy user of `simd_int_ops` and `bf16_tile_gemm`. Most-affected consumer by R-5's per-arch crossover constants, because candle's q4/q8 paths have similar crossover decisions. Will likely adopt the same crossover-as-const pattern within the next 1-2 quarters. + +### 8.3 `lance-graph::blasgraph` (graph queries) + +Owner of tropical-GEMM (R-7); the codec is a consumer, not an owner, of that kernel. PR-X12's allowed dependency direction (`ndarray-codec → lance-graph::blasgraph`) was confirmed under R-7 only after careful audit; previously lance-graph could only consume `ndarray`, not be consumed by sibling crates. M:E-H clarifies this dep direction is fine because both crates sit above ndarray and below woa/q2. + +### 8.4 `surrealdb` (vector + relational DB) + +Uses `cam_pq::hnsw_search` for vector lookups, `simd_int_ops` for filter expressions. Will inherit R-13's federated-codebook pattern for its own quantized vector indexes (long-discussed, not scheduled). + +### 8.5 `MedCare-rs` (medical imaging) + +The doc most likely to drive R-1's basis trait to its limits — medical imaging uses DCT, DWT (wavelet), and 3D radon transforms, all of which want to be `Basis` impls. Provides the second non-trivial test of the basis trait after PR-X12 ships. Federated-codebook policy (R-13) is *required* for medical imaging because PHI rules prohibit per-instance codebooks leaking patient-specific symbol distributions. + +### 8.6 `smb-office-rs` (office document OCR) + +Heavy user of conv (`activations::conv2d`) and attention (within `burn`-backed models). Less affected by PR-X12's specific reservations; more affected by R-11's latency assertions, because office OCR is latency-sensitive for interactive use cases. + +### 8.7 `q2` (transport / coordination) + +Owns the federation policy (R-13), the codec version negotiation, and the per-arch capability gossip. q2 doesn't itself touch `ndarray::hpc` — it routes requests to consumers that do. q2's interaction with PR-X12 is at the orchestration layer: scheduling, codec version constraints, federated codebook policy. + +--- + +## 9. What PR-X12 must NOT break + +In light of the above, the irreducible commitments PR-X12 must keep for the consumer ecosystem: + +1. **Substrate API stability** — `blas_level2::batched_gemm`, `cam_pq::kmeans`, `fft::dct_apply`, `activations::conv2d` keep their signatures across PR-X12 changes. Additions OK, breaks not OK. +2. **Per-arch dispatch transparency** — consumers continue calling capability-trait methods; the substrate continues choosing the right SIMD path. +3. **`Reducer` ordered-sum guarantee** — any consumer using `OrderedKahanReducer` (or similar) continues to get bit-exact cross-arch reductions. +4. **Latency-assertion CI infrastructure** — R-11's framework is consumer-callable for their own benches; not codec-private. +5. **Codebook handle indirection** (R-13) — the codec ships with the handle pattern, consumers can swap codebooks without forking. + +If PR-X12 keeps these five things stable, the consumer crates inherit the win. If any one breaks, the cascade across burn/candle/lance-graph/surrealdb is weeks of remediation per affected crate. + +--- + +## 10. Cross-references + +- **Substrate canon:** `pr-x12-substrate-merged-canon.md` +- **Resolutions:** R-3, R-4, R-5, R-7, R-11, R-13 in `pr-x12-canon-resolutions-delta.md` +- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` +- **Future capability lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` +- **WoA-side architecture:** check `woa-rs` repo `docs/architecture.md` (not in this repo) +- **Q2 transport:** see `q2` repo for codebook gossip protocol design +- **Federation policy reading:** R-13 calls out the model; q2 will implement + +_Last edit: 2026-05-22._ diff --git a/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md b/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md new file mode 100644 index 00000000..493b8cb0 --- /dev/null +++ b/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md @@ -0,0 +1,281 @@ +# PR-X12 — x265 / HEVC through the BLAS-GEMM Lens + +> Date: 2026-05-22 +> Status: **perspective doc** — re-reads the HEVC/x265 design space as a sequence of GEMM operations. Companion to `pr-x12-substrate-merged-canon.md` and `pr-x12-canon-resolutions-delta.md`. +> +> Premise: every x265 inner loop has a GEMM form. HEVC was designed in 2013 against hardware that made per-pixel butterflies the fast path; modern hardware (VNNI, AMX, BF16) inverts that ranking. PR-X12 is what x265 would have been with the right hardware floor. + +--- + +## 0. The thesis in one sentence + +**x265 implements roughly nine inner loops, six of which collapse to GEMM under the SSD/k-means/tropical reformulations, three of which stay non-GEMM and live in cheap per-byte paths.** PR-X12 spends ~80% of encode time inside BLAS calls; HEVC reference spends ~30%. The reframing is not metaphor — it is an algebraic identity per stage. + +--- + +## 1. The nine HEVC primitives, classified + +| # | Primitive | HEVC native form | GEMM form | Where it lands | +|---|---|---|---|---| +| 1 | Motion estimation | SAD `Σ \|A-B\|` | SSD `\|\|A\|\|² - 2A·B + \|\|B\|\|²` → GEMV | `ndarray::hpc::blas_level2::batched_ssd_search` | +| 2 | Forward transform | 4×4 / 8×8 / 16×16 / 32×32 DCT-II butterflies | Batched DCT as GEMM at N≥64 | `ndarray::hpc::fft::DctIIBasis` + `bf16_tile_gemm` | +| 3 | Quantization | Scalar divide + round | Dot product against quant matrix | Inline; uses existing `simd_int_ops` | +| 4 | Mode decision (CTU split) | Recursive RDO, `O(4^d)` | Tropical-GEMM Bellman-Ford, `O(d²)` | `lance-graph::blasgraph::tropical_gemm` | +| 5 | Basin assignment (palette / k-means) | Linear scan distance comparisons | Batched Hamming/L2 dist as GEMM | `ndarray::hpc::cam_pq::kmeans` | +| 6 | Deblocking filter | 3×3 / 5×5 per-pixel separable conv | im2col + GEMM at block size ≥ 16 | `ndarray::hpc::activations` (existing conv path) | +| 7 | rANS state advance | u32 state machine | Symbol-frequency lookup; **not GEMM** | `ndarray-codec::ans` | +| 8 | Header bit-pack | u16 shift+mask | Not GEMM (per-leaf, ~5 ns) | `src/hpc/codec/mode.rs::pack_header` | +| 9 | Stream framing / sync | Byte-level append | Not GEMM | `ndarray-codec::stream` | + +Stages 1-6 (the inner-loop-cost-dominant ones) are all GEMM. Stages 7-9 are I/O-bound and stay per-byte. The boundary between them is sharp because GEMM amortises hardware fusion (AMX, VNNI) while state-machine code can't. + +--- + +## 2. Per-stage detail — the algebraic moves + +### 2.1 Motion estimation: SAD → SSD (R-6) + +HEVC's reference encoder uses SAD because in 2007-2013, ARM hand-tuned `VPSADBW` was the fastest 16×16-block-difference primitive. SAD has no matrix structure — the absolute value inside the sum doesn't factor. + +SSD is algebraically richer: + +```text +SSD(A, B) = Σ_ij (A_ij - B_ij)² + = Σ A² - 2 Σ (A·B) + Σ B² + = ||A||² - 2·(A·B) + ||B||² + ▲ + └── this is a GEMM/GEMV +``` + +For N motion-vector candidates against one reference block: + +- Candidate matrix `A_batch`: `(N × 256)` — 256 = 16×16 pixels per block +- Reference vector `B`: 256-d +- Middle term: `A_batch @ B` → `(N × 1)` GEMV +- `||A_i||²` precomputed once per candidate window; `||B||²` once per reference + +**On Cascade Lake+ with VNNI:** `VPDPBUSD` = 64 i8·i8→i32 ops/cycle. One 256-elem dot product = 4 ops = ~4 cycles. Versus `VPSADBW` SAD path: ~128 cycles per 16×16. **Speedup: 30-50× depending on batch.** + +**On Sapphire Rapids with AMX:** TDPBUSD tile op = 256 i8·i8→i32 ops in one tile cycle. 16 candidates batched fits one AMX tile; throughput rises by another factor of 4. + +Net: motion estimation is ~50× faster than HEVC reference, *for the same wire-format semantics*. Same MV grid, same precision, same RDO. The math is identical; the substrate is BLAS. + +### 2.2 Transform: per-block butterflies → batched DCT (R-5) + +HEVC ships Loeffler / Lengwehasatit 1D DCT-II butterflies — fast at single-block sizes (~80 ops per 32×32 transform), bad at batched dispatch. The Loeffler factoring is what made 2010-era CPUs (no SIMD GEMM at small sizes) able to encode HEVC at all. + +PR-X12 keeps the butterflies for small N and dispatches to BLAS GEMM at N ≥ 64: + +```text +N = number of contiguous transform blocks + +if N < 64: per-block butterfly (Loeffler) — fits L1, no batching overhead +if N >= 64: batched DCT as GEMM via DctIIBasis + bf16_tile_gemm + ~256 cycles for 64 blocks (AMX) vs ~1280 cycles butterfly +``` + +Crossover (R-5) varies per arch: SPR=64, SKX/ICL=32, Zen 4=96, Apple Silicon=256. + +**The trait pattern (R-1):** `DctIIBasis` implements `Basis` — the basis is data (the cosine matrix, computed once at startup). The reduction (`A4 transform path` and `EWA splat rasterizer Plan E`) both call `basis.apply(src, dst)`. **Same basis, two consumers.** + +### 2.3 Quantization: stays per-byte, doesn't need GEMM + +Scalar quantization is `q_ij = (coeff_ij * scale_ij) >> 15`. Per-coefficient cost ~1 ns; the entire 32×32 block quantizes in ~1000 ns scalar, no batching benefit. Stays at SIMD-batched i16 path (`simd_int_ops`), no GEMM layer. + +### 2.4 Mode decision: recursive RDO → tropical-GEMM (R-7) + +HEVC's partition decision walks the quad-tree recursively, computing Lagrangian cost at each split: + +```text +For each CTU at depth d: + for each of 4 children: + recursive RDO at depth d+1 + compute mode + transform + quant + rate + distortion + combine via min(D + λ·R) + +Time: O(4^d) per CTU. At d=4 (PR-X12): 256 leaves worst case. +``` + +Tropical-semiring reformulation: the (+, min) algebra has GEMM. Build the 85-node DAG with edge weights `W[parent, child] = ΔRDO`, then iterate `D ← min(D, D + W)` (one tropical-GEMM step). Repeat for d iterations. + +```text +Naive recursive: O(4^4) = 256 ops × |nodes| = ~22 K ops/CTU +Tropical-GEMM: O(d²) × |nodes| = 16 × 85 = ~1.4 K ops/CTU + ~16× speedup +``` + +For 4K @ 60 fps with 132K CTUs/frame, this is the difference between **4 ms and 64 ms per frame just for partition RDO**. At 60 fps's 16.67 ms budget, naive RDO doesn't fit. + +**Dep direction:** the tropical-GEMM kernel lives in `lance-graph::blasgraph` (it's been the cognitive-side substrate for years). Post-Plan-H, `ndarray-codec → lance-graph::blasgraph` is allowed because both are sibling crates above `ndarray` hardware. + +### 2.5 Basin assignment: k-means as batched dist + argmin + +For each cell, find the nearest of 4096 basin centroids: + +```text +distances[c] = ||cell - centroid_c||² for c in 0..4096 +basin = argmin(distances) +``` + +Both the distance computation and the argmin are batched primitives: + +- **Distance computation:** if cells are i8 fingerprints, batched Hamming distance via `VPOPCNTDQ` (Ice Lake+). If cells are f32/bf16, batched L2 via `_mm512_add_ps` after `_mm512_sub_ps`. +- **Across 4096 centroids:** matrix form. `dist = ||cells||² ⊕ ||centroids||² − 2 · (cells @ centroids^T)`. Same SSD identity as ME, scaled to codebook size. + +`cam_pq::kmeans` already ships this in `src/hpc/`. The codec's basin-assign step is a thin wrapper. + +### 2.6 Deblocking filter: per-pixel conv → im2col GEMM (only at scale) + +3×3 / 5×5 separable filters at block edges. For a single CU's deblocking pass (~64 edge pixels), per-pixel conv wins. For batched deblocking across many CUs in a frame, im2col + GEMM wins by ~3-5× on AMX-class hardware. + +x265's deblocking is one of the few stages that explicitly has per-block-size branches; PR-X12 keeps the same structure but dispatches the batched form through `ndarray::hpc::activations`. + +### 2.7 rANS: stays as state machine + +Not a GEMM. State machine that reads symbols, looks up `(freq, cumfreq)`, advances u32 state. ~10 ns/symbol on modern x86. Per-frame rebuild of the frequency table is the only batchable step (a sum-reduce, trivially SIMD). + +### 2.8 Header bit-pack / stream framing + +Per-leaf, 5-30 ns. No GEMM. Lives in `mode.rs::pack_header` / `pack_leaf` and the future `stream.rs`. + +--- + +## 3. Why HEVC's 2013 design space was BLAS-impoverished + +The HEVC spec was finalised in early 2013, against the following hardware: + +- **No VNNI** — Cascade Lake shipped 2019. `VPDPBUSD` is six years after HEVC was frozen. +- **No AMX** — Sapphire Rapids shipped 2023. Ten years after the spec. +- **No bfloat16** — first appeared on SPR. HEVC's transform precision was set to fit i16 because i16 GEMM on Sandy Bridge SSE4 was the only practical option. +- **No `VPOPCNTDQ`** — Ice Lake 2019. HEVC's palette mode (SCC profile) was frozen with the assumption that 64-entry palettes were the cap, because larger palettes would have needed Hamming-distance GEMM that didn't exist. + +**The HEVC team made the right choices for 2013 hardware.** Per-pixel butterflies were faster than batched GEMM at small sizes. SAD via `VPSADBW` was faster than SSD via any 2013-era integer SIMD. 64-entry palettes were the largest size where the linear-scan k-means inner loop fit L1 budget. + +**Every one of those choices is now obsolete.** The PR-X12 substrate isn't a redesign of HEVC's wire format — it's HEVC's wire format with the inner loops swapped out for what 2026 hardware actually wants. + +--- + +## 4. The reframing: PR-X12 IS x265 done as BLAS + +| Aspect | HEVC reference | PR-X12 | +|---|---|---| +| Wire format | 16-bit header + per-mode tail | **same** | +| Mode taxonomy | Skip / Merge / Delta / Escape | **same** | +| Quad-tree partition | 64×64 CTU → 8×8 leaf | **same**, `Ctu` runtime-flex (M:E-G) | +| Palette / basin codebook | 64 entries max | 4096 entries (12-bit, full HHTL Leaf tree) | +| RDO criterion | `D + λ·R` Lagrangian | **same** | +| RDO solver | recursive `O(4^d)` | tropical-GEMM `O(d²)` (R-7) | +| ME criterion | SAD | SSD (R-6) — algebraically lossless reframing | +| Transform | per-block Loeffler | batched DCT GEMM at N≥64 (R-5) | +| Entropy coder | CABAC | rANS — better Shannon-efficiency, simpler state | +| In-loop deblocking | per-pixel conv | im2col GEMM at batch (existing infra) | + +**The wire format is unchanged.** A PR-X12-encoded video should be decodable by an HEVC-spec decoder (modulo the rANS↔CABAC swap and the 4096-entry palette), because the semantic primitives — Skip/Merge/Delta/Escape, quad-tree CTU, RDO Lagrangian, DCT-II basis — are identical. + +**What changed is the implementation.** Each inner loop is now a BLAS call. + +--- + +## 5. What lands in `ndarray::hpc::blas_level2` (the codec's BLAS surface) + +The codec uses, but does not own, these four primitives: + +```rust +// R-6: ME via SSD identity +pub fn batched_ssd_search( + candidates: &[i8; 256], // (N × 256) row-major + n_candidates: usize, + reference: &[i8; 256], + out_distances: &mut [u32], // length N +); + +// R-5: batched DCT-II via GEMM +pub fn batched_dct_ii( + blocks: &[i16], // (M blocks × N×N) row-major + n_blocks: usize, + out: &mut [i16], // output coefficients +); + +// R-7: tropical-GEMM partition (lives in blasgraph, called from codec) +pub fn tropical_partition_rdo( + edge_weights: &[f32; 85], + out_min_costs: &mut [f32; 85], +); + +// k-means basin assignment (uses existing cam_pq) +pub fn kmeans_predict_batched( + cells: &[Fingerprint], + centroids: &[Fingerprint; 4096], + out_basin_idx: &mut [u16], +); +``` + +**Codec layer:** ~30-50 LoC per stage to wrap the BLAS call into the predict/A6/A4 flow. **BLAS layer:** zero new lines — all four already exist or land via existing infrastructure (`bf16_tile_gemm`, `cam_pq`, `simd_int_ops`). + +This is what makes R-3's ≤1500 generic-codec-LoC ceiling reachable. Most of the heavy lifting is already in `blas_level2`; the codec adds wrappers and orchestration, not new BLAS code. + +--- + +## 6. The "blasgraph synergy" claim made precise + +Earlier docs cited "blasgraph + MKL synergies" loosely. Quantified: + +**Of nine codec inner loops, six dispatch to BLAS:** + +| Loop | BLAS primitive | Existing infra | +|---|---|---| +| ME | SSD via VNNI GEMV | `blas_level2` after R-6 lands | +| Transform | Batched DCT GEMM | `bf16_tile_gemm` + `DctIIBasis` | +| Quant | Stays per-byte | n/a | +| Mode decision | Tropical-GEMM | `lance-graph::blasgraph` | +| Basin assign | Hamming/L2 batched dist | `cam_pq::kmeans` | +| Deblocking | im2col GEMM | `activations` (existing conv path) | +| rANS | Stays state-machine | n/a | +| Header | Stays per-byte | n/a | +| Framing | Stays per-byte | n/a | + +**On SPR with all six BLAS-dispatch paths active**, profile-guided estimate (calibrated during Plan G): + +- ~80% of total encode time spent inside BLAS calls +- ~15% in rANS + header + framing (the per-byte paths) +- ~5% in quantize + scalar housekeeping + +**HEVC reference encoder on the same SPR:** ~30% inside BLAS (mostly deblocking and ME bookkeeping); the rest is per-pixel butterflies + recursive RDO + SAD. The hardware sits idle 70% of the time at peak SIMD width. + +**The 50× ME speedup, 16× partition RDO speedup, and 4× transform speedup compose** because they sit in different stages of the encode pipeline. End-to-end encode at 4K @ 60 fps becomes feasible on a single SPR socket. + +--- + +## 7. Plan G video lane: the falsifier + +Per R-4, the video lane of `codec-bench` clears `≥0.95× x265 ultrafast ratio at PSNR ±0.1 dB on Big Buck Bunny 1080p`. The R-11 latency assertion adds: total encode time for the clip must complete within (clip duration × 0.5). + +**The hidden falsifier in §6's BLAS-synergy claim:** if Plan G's video lane profile shows <60% time-in-BLAS, the BLAS reframing is decorative — actually a critical bug, because it means the per-byte stages (rANS / header / framing) are dominating, which means SIMD-batched-encode (R-11) didn't actually land on the codec hot path. + +**Suggested Plan G profile assert:** `perf stat -e cycles,instructions,L1_DCACHE_LOAD_MISSES` over the encode, with a sub-test breaking down cycles per stage. If the BLAS-dispatch stages don't sum to ≥60% of cycles, the abstraction is wrong somewhere. + +This is the kind of test that catches "we wrote the code but it's not actually using the GEMM path because the dispatcher fell through to scalar" — a class of bug that ate weeks of PR #134 / #175 SIMD work and only surfaced in CI. + +--- + +## 8. What this lens unlocks for x266 / next-gen codecs + +The next document (`pr-x12-x266-3dgs-spacetime-upscaling.md`) asks what's possible if the substrate isn't x265-compatible — if we *replace* in-loop filters with 3DGS space-time interpolation. The answer becomes obvious once the codec is read as a GEMM pipeline: the in-loop filter is just another GEMM stage in the pipeline, and replacing it with a different GEMM (one whose output is a 3DGS-rendered reference frame) costs no architectural complexity — only ships a different `Basis` impl. + +That's the bridge to the next doc. + +--- + +## 9. Cross-references + +- **R-N citations:** `pr-x12-canon-resolutions-delta.md` +- **Architecture canon:** `pr-x12-substrate-merged-canon.md` +- **Mechanical spec:** `pr-x12-codec-x265-design.md` (what's getting reimplemented) +- **Next lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` +- **In-tree code:** + - `src/hpc/blas_level1.rs`, `blas_level2.rs`, `blas_level3.rs` — host for `batched_ssd_search`, `batched_dct_ii` + - `src/hpc/cam_pq.rs` — k-means basin assignment + - `src/hpc/bf16_tile_gemm.rs` — AMX-class GEMM dispatch + - `src/hpc/codec/{ctu,mode,predict}.rs` — codec wire format + +_Last edit: 2026-05-22._ diff --git a/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md new file mode 100644 index 00000000..14ba0f2d --- /dev/null +++ b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md @@ -0,0 +1,328 @@ +# PR-X12 — x266 / Next-Gen Codec via 3DGS Space-Time Upscaling + +> Date: 2026-05-22 +> Status: **speculative perspective doc** — explores what becomes possible when the codec substrate (PR-X12) is extended one step beyond HEVC compatibility, into territory that subsumes both AI-frame-interpolation and AI-super-resolution as codec-native deterministic operations. Companion to `pr-x12-x265-blasgraph-gemm.md`. +> +> Status caveat: nothing in this document is committed as PR-X12 scope. It's the future shape that PR-X12's substrate makes obvious. Plan E + Plan G prerequisites must land first. +> +> Premise: in-loop reference-frame reconstruction in HEVC is a 2D pixel-grid render. In a 3DGS-augmented codec, it's a re-rasterization from a 3D Gaussian scene model. Same trait (`Basis`), different impl. The decoder becomes responsible for (resolution, frame-rate) at playback time, not (encoder, capture). + +--- + +## 0. One-sentence thesis + +**HEVC's in-loop filter is a `Basis::apply` call whose output happens to be a 2D pixel array.** Replace that with an EWA-splat `Basis::apply` whose output is a 2D rasterization of a 3D Gaussian scene at a parameter-controlled (resolution, time), and *the same encoder produces a free space-time upscalable bitstream* — no AI frame interpolation, no neural super-resolution, just deterministic re-rasterization from a scene model that already lives in the wire format. + +--- + +## 1. The capability gap PR-X12 closes + +Current state-of-the-art for high-quality playback at non-native rate: + +- **Frame interpolation:** DAIN, RIFE, FILM — learned optical flow models that hallucinate intermediate frames. Per-frame inference cost ~30-100 ms on a GPU. Non-deterministic across model versions. No codec integration. +- **Super-resolution:** ESRGAN, Real-ESRGAN, DLSS-FG — learned upscalers. Per-frame cost similar. Same non-determinism and integration gap. +- **Codec-native upscaling:** lanczos/bicubic — deterministic but low-quality; H.266/VVC adds Reference Picture Resampling, but it's still a 2D resample, not a 3D-scene-aware reconstruction. + +**The PR-X12 substrate exposes a third option:** ship a 3D scene model in the bitstream, and let the decoder render at arbitrary (res, fps). The 3D scene model is *the reference frame*, not a precomputed 2D image. This isn't novel as research (3DGS papers from 2023-2025) — it's novel as a *codec primitive*, because no codec has been able to express "the in-loop filter is a basis swap, swap it" cleanly. PR-X12 can. + +--- + +## 2. 3DGS as a `Basis` impl — the trait shape + +Recall (R-1, M:E-A): `LinearReduce` decomposes into `Basis` + reduction. The basis is data; the reduction is the inner loop. The codec's transform path calls `basis.apply(src, dst)`; Plan E's EWA splat rasterizer calls the same. + +```rust +pub trait Basis { + /// Apply this basis to a source array, writing into a destination. + /// For DCT: src = pixel block, dst = coefficient block. + /// For EWA: src = 3DGS scene params, dst = rasterized 2D pixel frame. + fn apply>( + &self, + src: &[T], + dst: &mut [T], + params: &Self::Params, // basis-specific: viewport, time, etc. + reducer: R, + ); + + type Params; +} + +// Existing (R-1, R-5): +impl Basis for DctIIBasis { + type Params = (); + fn apply>(&self, src: &[i16], dst: &mut [i16], _: &(), r: R) { + // batched DCT-II via bf16_tile_gemm at N >= 64 + } +} + +// Future (Plan E, then x266): +impl Basis for EwaSplatBasis { + type Params = ViewportTime; // camera pose + timestamp + fn apply>( + &self, + gaussians: &[GaussianRecord], // 3DGS scene (5-7 KB per cell, see §6) + out_frame: &mut [f16], // 2D pixel buffer at target res + vp: &ViewportTime, // (W, H, t) — chosen by decoder + r: R, + ) { + // Rasterize 3DGS scene at (W, H, t) + // Same per-tile GEMM pattern as ndarray-image's existing EWA path + } +} + +struct ViewportTime { + width: u32, + height: u32, + time_ms: u64, // frame timestamp; 3DGS scene is continuous in t + camera_pose: Mat3x4f, // identity for monoscopic; non-trivial for VR +} +``` + +**The crucial property:** the codec body (`ndarray-codec`) doesn't know whether it's calling `DctIIBasis` or `EwaSplatBasis`. It dispatches via the trait. The bitstream header (`Ctu` header bits, see M:E-J) selects which basis is in play. + +This is exactly the kind of substrate flexibility R-1 was designed to provide. Without R-1, this paragraph is fantasy; with R-1, it's a 6-week engineering effort to land Plan E and wire the trait. + +--- + +## 3. The encoder problem: fitting a Gaussian scene model to a clip + +Encoding a video clip with a 3DGS scene anchor means: given N input frames at known camera pose (or estimated pose), find a 3DGS scene S such that rendering S at each frame's (pose, time) reproduces the input frames to within a target PSNR. + +This is a standard 3DGS fitting problem (Kerbl et al. 2023, Mip-Splatting 2024). The relevant fact for PR-X12: + +```text +Input: N frames @ (1080p, 24 fps) for 10 seconds = 240 frames +Output: scene S = ~100K-500K anisotropic Gaussians + ~32 bytes per Gaussian (position 3×f16, scale 3×f16, + rotation quaternion 4×f16, color 3×f16, opacity 1×f16 = 28 B) + + quantized SH coefficients for view-dependent color (~8-16 B) + Total: ~40-50 bytes per Gaussian × 200K = 8-10 MB per scene anchor + +Compare to: + 240 frames × 3 MB (Bbb 1080p I-frame at HEVC Q=20) = ~720 MB raw I-frame + HEVC encoded @ ~5 Mbps = ~6.3 MB for the whole clip +``` + +So the scene-anchor encoding is the same order of magnitude as standard HEVC encoding *for one anchor period*. The win comes from: + +1. **Re-rasterization is free** — render at 4K, 8K, 60 fps, 120 fps, all from the same 8 MB scene model +2. **Anchor periods stretch** — if motion is low, one anchor lasts 10+ seconds; HEVC has to re-encode I-frames every ~2 sec for random-access seek +3. **View interpolation** — for VR/stereo, render two views from one scene; HEVC needs to encode two streams + +**The encoder pipeline:** + +```text +Anchor frame n: + 1. Estimate camera pose from frame n+0 through n+anchor_period (~240 frames) + 2. Initialize Gaussian cloud from frame n's depth estimate + 3. Optimize cloud via gradient descent: minimize Σ |render(S, t_k) - frame_k|² + (This is k-means-like; uses cam_pq infrastructure) + 4. Quantize to scene-anchor format (see §5) + +Per-frame delta n+1, n+2, ...: + Standard HEVC inter-prediction against the 3DGS-rendered ref frame. + The 3DGS-rendered ref is computed by the decoder too, so the delta is + in the same algebraic space as HEVC. +``` + +The clever part: **the decoder's reference frame for inter-prediction is the 3DGS render at the previous frame's (pose, t)**. So the per-frame delta is small — most motion is already captured in the scene model. + +--- + +## 4. The decoder problem: rasterizing at arbitrary (res, fps) + +The decoder receives: + +- Scene anchor: scene S (8-10 MB) at clip start, then refreshes every ~250 frames +- Per-frame deltas: standard HEVC-like residual, against the 3DGS-rendered ref + +At playback time: + +```text +For each output frame at (W_target, H_target, t_target): + 1. Render scene S at (W_target, H_target, t_target) via EwaSplatBasis::apply + Output: ref_frame in pixel buffer + 2. Decode per-frame delta against ref_frame + 3. Apply standard HEVC in-loop filtering (deblock + SAO) + 4. Emit pixel buffer +``` + +**Key observation:** step 1 is parametrised in (W, H, t). The encoder shipped a 1080p @ 24 fps clip; the decoder renders at 4K @ 60 fps by choosing different (W_target, H_target, t_target) tuples. The scene model is continuous in (W, H, t); the rasterizer interpolates. + +This is **codec-native space-time upscaling**, deterministic across decoder implementations because the math (EWA splat rasterization) is well-specified. Same scene model, same camera pose, same t → same pixels. No model versioning. No "frame interpolation v3 hallucinates differently than v2." + +**Cost per frame:** EWA splat raster of 200K Gaussians at 4K → ~5-15 ms on a modern GPU; ~50-100 ms on CPU. Tight for real-time decode at 60 fps on CPU; comfortable at 24-30 fps. R-11 latency assertion applies — Plan G's decoder lane must hit real-time at the target playback rate. + +--- + +## 5. Wire format: scene-anchor frames + per-frame deltas + +Building on M:E-J's 16-bit header layout (header_kind ∈ {Skip, Merge, Delta, Escape}), x266 needs one new header_kind: **SceneAnchor**. + +```text +HEVC-compatible PR-X12 header (16 bits, R-2): + bits 0-1: header_kind {Skip, Merge, Delta, Escape} + bits 2-13: basin_index (12 bits, M:E-J) + bit 14: CONSUMER-TYPED (semantic per frame-header `ConsumerProfile`; + cognitive: Pearl-rung high bit; video: reserved=0; + splat: LOD-cascade-source flag; gradient: worker-shard parity) + bit 15: UNIVERSAL "has inter-tier reference" (A3-inter); identical + across all four consumers + NOTE: leaf-size (8/16/32/64) is encoded structurally via `Ctu` + (M:E-G) at the type level, not via header bits. + +x266 extension (NOT in PR-X12 scope, future): + bits 0-1: header_kind, now 4 variants + 00 = Skip (HEVC-compatible) + 01 = Merge (HEVC-compatible) + 10 = Delta (HEVC-compatible) + 11 = Escape OR SceneAnchor (escape bit at byte boundary disambiguates) + bits 2-15: basin_index (12 bits) + scene_anchor_id (2 bits) when in anchor mode +``` + +**Anchor frame payload** (after the 16-bit header): + +```text +SceneAnchorFrame: + scene_id: u8 // which anchor in the GOP + num_gaussians: u24 // typically 50K - 500K + cam_pose_keyframes: u8 // number of pose anchors + [GaussianRecord; N]: // 40-50 bytes each, quantized + position: [u16; 3] // q15 fixed-point per axis + scale_log: [u8; 3] // log-quantized + rot_quat: [u8; 4] // quantized to 8-bit + sh_coeffs: [u8; 27] // 9 coefs per channel × 3 channels, q7 + opacity: u8 + pose_keyframes: [(t_ms: u32, Mat3x4f); cam_pose_keyframes] +``` + +Per-frame deltas after the anchor are standard HEVC-like, with one difference: the reference frame is derived by rasterizing the anchor scene at the frame's (pose, t), not by decoding a prior I-frame. + +**Bitstream compatibility:** an HEVC-spec decoder that doesn't understand `SceneAnchor` headers can fall back to displaying the inter-frame deltas as zero-padded macroblocks (visibly broken, but won't crash). A PR-X12 decoder with EwaSplatBasis loaded plays back at native quality. + +--- + +## 6. Bandwidth math: when does this beat HEVC? + +Rough rule (calibrated against published 3DGS papers): + +```text +Clip: 10 seconds, 1080p, 30 fps, modest motion (e.g. Bbb sample) + +HEVC reference (5 Mbps avg, hardware encoded): + bytes = 5 × 10⁶ × 10 / 8 = 6.25 MB + +PR-X12 + 3DGS anchor (single anchor for the clip): + anchor: 200K Gaussians × 40 B = 8 MB + deltas: ~300 frames × 1 KB avg = 300 KB + Total: 8.3 MB + +→ HEVC wins by ~25% for native (1080p, 30 fps) playback. + +BUT for 4K @ 60 fps playback: + HEVC: re-encode at 4K/60fps target = 4 (res) × 2 (fps) × 6.25 = 50 MB + (4× pixel scaling × 2× framerate scaling × 6.25 MB native bitrate; + or super-res upscaling at decode = 6.25 MB + neural inference) + PR-X12 + 3DGS: same 8.3 MB + decoder rasterizes at (4K, 60 fps); the math is in the scene + +→ PR-X12 wins by ~6× for high-resolution playback, + AND playback is deterministic (no neural model versioning). +``` + +**Where the crossover sits:** PR-X12 + 3DGS becomes a win when the playback target (W × H × fps) exceeds the encode target by ~1.3× (the point at which HEVC's re-encoded size crosses the fixed 8.3 MB PR-X12 budget). At 1× (native), HEVC is a hair cheaper. At 8× pixel-bandwidth (4K@60 from 1080p@30), PR-X12 dominates by ~6×. + +This matches the intuition that **3DGS is a scene model**, not a frame model — its compression ratio improves with resolution, while HEVC's degrades. + +--- + +## 7. The "free upscaling" insight — why this isn't AI + +Critics will read §6 and say "this is just AI upscaling rebranded." The distinction is sharper than it sounds. + +**AI upscaling** (DLSS, ESRGAN, Real-ESRGAN, RIFE, DAIN, FILM): +- Input: 2D pixel array at low res +- Model: learned NN with millions of parameters; non-deterministic across versions +- Output: 2D pixel array at high res, with hallucinated detail +- Failure mode: hallucinates wrong detail (e.g. wrong text on a sign) +- Latency: per-frame ~30-100 ms on a GPU +- Codec integration: zero + +**PR-X12 + 3DGS rasterization** (this doc): +- Input: 3D Gaussian scene + camera pose +- Model: closed-form EWA splat formula (Zwicker et al. 2001, refined in 3DGS papers) +- Output: 2D pixel array at any res, computed deterministically +- Failure mode: misses detail that wasn't in the scene model — but never hallucinates +- Latency: per-frame ~5-15 ms on a GPU; ~50-100 ms on CPU +- Codec integration: full, basis trait dispatch + +The 3DGS scene captures the actual 3D geometry of what was in front of the camera. Rasterizing at higher resolution doesn't invent detail — it *samples the 3D scene more finely*. If the encoder couldn't fit a detail (e.g. the text on a small sign), the decoder can't recover it. That's a **failure of completeness**, not a failure of fidelity. Compare to AI upscaling, which has both modes and can't tell you which is happening. + +For high-stakes video (legal evidence, medical imaging, scientific recording), this distinction matters. PR-X12 + 3DGS is **legally and scientifically defensible** in a way no learned upscaler can be. + +--- + +## 8. PR-X12 prerequisites + +Nothing in this doc is in PR-X12 scope. What it requires from PR-X12: + +| Requirement | Source | Status | +|---|---|---| +| `Basis` trait with parametric `apply` | R-1, M:E-A | landed in concept; implementation in Plan A4 | +| EWA splat rasterizer as `Basis` impl | Plan E | scheduled | +| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit | +| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | landed | +| Plan G video lane validates per-arch latency | R-4, R-11 | scheduled | +| Federated codebook policy for scene anchors | R-13 | landed | + +The path to x266-like capability is: + +1. Land PR-X12 (HEVC-compatible, no 3DGS). Plan A4 → Plan H. +2. Plan E ships EWA splat as `Basis`. +3. New crate `ndarray-codec-scene` (or extension within `ndarray-codec`) adds `SceneAnchor` header kind + scene encoder/decoder pipelines. +4. Bench against AI upscaling pipelines (RIFE / Real-ESRGAN) on quality and latency. +5. Standardise the wire format extension (separate spec, not HEVC-compatible). + +Conservative estimate: **24-36 months from PR-X12 merge**, assuming Plan E lands on schedule and 3DGS encoder math is taken from existing research (no novel algorithms required). + +--- + +## 9. Falsifiers + +What kills this path? Be specific: + +**F-1: Encoder math doesn't converge for general video.** 3DGS papers focus on static scenes with controlled camera motion. Real video has occlusion, transparency, fast motion. If 3DGS fitting can't hit PSNR ≥ 35 dB on motion-heavy clips (e.g. sports footage) within reasonable encode time, the substrate is decorative. **Mitigation:** restrict scope to slow-camera-motion content (talking heads, drone footage, security cameras); HEVC stays the fallback for sports. + +**F-2: Decoder rasterization too slow.** If EwaSplatBasis::apply can't hit real-time at 4K @ 60 fps on a 2026-class CPU, the codec is server-side only. **Mitigation:** PR-X12's R-11 latency assertion catches this in CI; if the CPU path fails, the codec emits a GPU-required flag in the bitstream. + +**F-3: Wire format ossifies.** If HEVC stays dominant and x266 adoption is slow (the H.266/VVC story so far — 2020 release, still <5% market share in 2026), the SceneAnchor extension never sees a standards body. **Mitigation:** ship it as a non-standard extension first, in an open-source decoder; let market traction force standardisation. + +**F-4: Patents.** 3DGS-as-codec-primitive may sit in a patent thicket. Some 3DGS rendering optimisations (tile binning, depth sorting) are likely patented. **Mitigation:** the basis trait is general; if Gaussian splats are patented, swap to another basis (TensoRF, NeRF compaction, point cloud + bilinear) — same architecture, different math. + +None of these falsifiers invalidate PR-X12 itself. They only constrain the post-PR-X12 path. + +--- + +## 10. Why this lens matters now, for PR-X12 scoping + +The temptation in scoping PR-X12 is to optimise for HEVC compatibility only — strip out anything that doesn't directly serve the x265-replacement story. **The basis trait (R-1) and the EWA-splat schedule (Plan E) survive that pruning** because they were independently motivated. This doc makes the case that they were also the right call by another measure: they're the substrate that lets x266 happen at all. + +Concretely: + +- **Do not** weaken `Basis` to be DCT-only "for now." The generality has zero LoC cost (the trait is the same) and unlocks 3DGS later. +- **Do** keep Plan E on the roadmap even if Plan H/codec-fast-path pressure tries to defer it. EWA splat is the first non-DCT basis and validates the trait shape. +- **Do** keep the codec body free of basis-specific code. M:H-NEW-2's "ratchet on codec LoC at the basis boundary" already enforces this; the x266 lens is why it matters. + +--- + +## 11. Cross-references + +- **Substrate dependencies:** R-1, R-2, R-3, R-11, R-13 in `pr-x12-canon-resolutions-delta.md` +- **Basis trait architecture:** §M:E-A in `pr-x12-substrate-merged-canon.md` +- **EWA splat planning:** Plan E in `pr-arc-inventory.md` +- **Codec foundation:** `pr-x12-codec-x265-design.md` +- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` +- **Bandwidth comparison reading list:** 3DGS (Kerbl et al. SIGGRAPH 2023), Mip-Splatting (Yu et al. 2024), 4DGS (Wu et al. 2024) + +_Last edit: 2026-05-22._ +_Status: speculative — explores what's possible after PR-X12 lands; not in PR-X12 scope._ diff --git a/.gitignore b/.gitignore index bf2b8312..286c990a 100644 --- a/.gitignore +++ b/.gitignore @@ -10,3 +10,6 @@ target/ # Claude Code: agent isolation worktrees (temporary, per-agent) .claude/worktrees/ + +# Claude Code: per-user permission overrides (survives branch switches) +.claude/settings.local.json