diff --git a/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md b/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md
new file mode 100644
index 00000000..46e6f515
--- /dev/null
+++ b/.claude/knowledge/pr-x12-anti-neural-lookup-inversion.md
@@ -0,0 +1,337 @@
+# PR-X12 — The Anti-Neural Codec: Lookup-Table Inversion of NN Inner Loops
+
+> Date: 2026-05-22
+> Status: **wildcard perspective doc** — the most interesting reframe I can articulate of PR-X12's substrate. Companion to the GEMM lens (`pr-x12-x265-blasgraph-gemm.md`), 3DGS lens (`pr-x12-x266-3dgs-spacetime-upscaling.md`), and orchestration lens (`pr-x12-woa-multiarch-orchestration.md`).
+>
+> Premise: every "neural codec" primitive in current research — VQ-VAE, neural RDO, neural rendering, learned wavelets — has a **frozen lookup-table dual** that achieves the same information-theoretic compression at 50-1000× lower inner-loop cost. PR-X12 systematically picks the lookup-table dual for every inner-loop op, then proves it converges to within an information-theory-bounded ε of the neural codec's compression ratio. The codec has **zero NN forward passes in the inner loop**, by design.
+
+---
+
+## 0. Thesis in one paragraph
+
+**A 4096-entry codebook indexed by a 12-bit fingerprint is structurally equivalent to a 1-layer 12-bit-input MLP that has been frozen and tabulated.** Any neural codec whose inner loop is "embed → match → score" can be replaced by "fingerprint → table lookup → score" for the same expressive power, at table-lookup latency (~3-10 ns) vs NN-forward-pass latency (~3-30 µs). PR-X12 makes this systematic: every primitive that *could* be an NN inner-loop op is instead a lookup table. The result is a codec that has the compression of a neural codec but the inner-loop cost of x265.
+
+This is not anti-NN. It is **anti-NN-in-the-inner-loop**. NNs train the tables. Once trained, the table replaces the NN.
+
+---
+
+## 1. The current research direction: NN-in-loop codecs
+
+Recent codec research direction (2020-2026):
+
+| Codec | NN role | Inner-loop cost |
+|---|---|---|
+| **Lyra** (Google, 2021) | Neural vocoder decoder | ~3 ms per 20 ms audio frame on a phone |
+| **SoundStream** (Google, 2021) | VQ-VAE encoder + neural decoder | ~10 ms per 20 ms audio frame |
+| **EnCodec** (Meta, 2022) | Residual VQ-VAE + transformer prior | ~30 ms per 20 ms audio frame on GPU |
+| **NVIDIA Maxine** (2020+) | Latent-space face encoding | ~16 ms per 1080p video frame on a 4090 |
+| **AOMedia ML-AV1** (research) | Per-CTU NN-based RDO | ~5-20 ms per CTU |
+| **Google ML-Image** (2023) | Learned transform + entropy model | ~100 ms per image on GPU |
+
+All of these share a common shape:
+- Encoder: input pixels → embedding network → quantize → bitstream
+- Decoder: bitstream → embedding → decoder network → output pixels
+- Inner loop has *at least one* NN forward pass per emit operation
+
+The compression results are excellent. Lyra hits ~3 kbps speech at 16 kHz quality. EnCodec matches MP3 at ~12× lower bitrate. The inner-loop latency cost is *catastrophic*: ~3-100 ms per emit, vs ~0.1-1 µs for x265's per-block inner loop.
+
+**The structural problem with NN-in-loop:**
+
+1. Each forward pass = thousands to millions of MAC operations
+2. Tensor framework overhead (PyTorch / candle / burn) = 50-200 µs per dispatch
+3. Model version drift across decoders breaks playback
+4. Quantization sensitivity: int8 NN weights vs f16 activations have numerical determinism issues
+5. Cannot run inside L1 cache; needs L3 / HBM for weights
+
+---
+
+## 2. The PR-X12 inversion: pre-baked lookup tables
+
+Every NN-in-loop primitive in §1 has a frozen-table dual. PR-X12 is the systematic instantiation of those duals:
+
+| NN-in-loop primitive | PR-X12 lookup-table dual | Inner-loop cost |
+|---|---|---|
+| VQ-VAE encoder embedding | k-means basin codebook (R-10, M:H-6) | ~10 ns per cell (L1-resident) |
+| VQ-VAE decoder | Same codebook, reverse lookup | ~3 ns per cell |
+| Neural RDO scoring | Tropical-GEMM partition (R-7) | ~1.4 K ops per CTU |
+| Neural rendering | EWA splat rasterizer (Plan E) | ~5-15 ms per 4K frame |
+| Learned transform | DCT-II batched GEMM (R-5) | ~256 cycles per 32×32 block |
+| Transformer prior / entropy model | Gaussian-tail rANS (R-10) | ~10 ns per symbol |
+
+The codec's inner loop **never** dispatches to a tensor framework. The basin codebook is a fixed `[Fingerprint; 4096]` slice (~256 KB, fits L2). The tropical-GEMM partition runs over an 85-node DAG (~1 KB working set). The DCT basis is a `[i16; N*N]` array (~8 KB for 64×64). All resident in cache, all branchless on the hot path.
+
+**The total NN-flops in the codec's inner loop: zero.** NNs trained the codebooks; the codebooks live in the bitstream / metadata; the decoder does table lookups, not forward passes.
+
+---
+
+## 3. The math: every NN-in-loop primitive has a lookup-table dual
+
+### 3.1 Basin codebook ≡ frozen 1-layer 12-bit MLP
+
+A VQ-VAE encoder's job: map continuous input embedding `x ∈ ℝᵈ` to discrete index `k ∈ {0..K-1}`, where centroid `c_k ∈ ℝᵈ` is the nearest among K learned centroids.
+
+```text
+VQ-VAE encoder:    k = argmin_j ||x - c_j||²
+VQ-VAE decoder:    x' = c_k
+```
+
+**PR-X12 basin codebook (R-10, M:H-6):** same algebraic operation, with the embedding step pre-computed by an OFFLINE training run (k-means over a corpus), then frozen into a 4096-entry codebook indexed by a 12-bit fingerprint.
+
+```text
+PR-X12 encoder:    fp = compute_fingerprint(x)    [~10 ns, deterministic hash]
+                   k  = codebook.nearest_index(fp) [~3 ns, table lookup]
+PR-X12 decoder:    x' = codebook[k]               [~3 ns]
+```
+
+**Why this is equivalent in expressive power:**
+
+- A 4096-entry lookup table on 12-bit input is structurally a `[4096]` array — i.e., a 12-bit-input 4096-output discrete function
+- Any 12-bit-input network has at most 2^12 = 4096 distinct outputs
+- A `Linear(12, K) → argmax` with frozen weights is structurally an array lookup
+- The codebook IS the trained network, materialized as data
+
+**Why this is faster:**
+
+- 4096-entry lookup: 1 memory ref (table is in L2 cache, 64 ns p99)
+- 1-layer 12-bit-input 4096-output Linear: ≥ 49,152 MACs + softmax + argmax ≈ 3-30 µs
+- **Speedup: 1000-5000×** per inner-loop emit
+
+The compression ratio (R) is bounded by Shannon's source coding theorem: R ≥ H(cells). The codebook achieves H(cells) up to a log-factor of K=4096 entries' overhead. A neural encoder achieves the same H(cells) (assuming optimal training). Compression is asymptotically equivalent; latency is not.
+
+### 3.2 Tropical-GEMM RDO ≡ frozen GNN
+
+Neural RDO research (AOMedia ML-AV1, others 2022-2025): train a graph neural network to score quad-tree partition candidates. Each CU is a node; edges are split decisions; node features include local pixel statistics; the GNN outputs a scalar RDO score.
+
+The GNN's expressiveness for this problem maps directly onto tropical-semiring arithmetic:
+
+```text
+GNN forward pass on RDO graph:
+    h_v^(l+1) = σ(W · aggregate(h_u^(l) for u in N(v)) + b)
+
+where aggregate = sum or max, σ = ReLU, ...
+
+Tropical semiring (R-7) on the same graph:
+    h_v^(l+1) = min_u (h_u^(l) + W_{uv})    [identity on min-plus algebra]
+```
+
+**Identity:** if the GNN's aggregator is `max` and σ is identity-on-positive, then the GNN forward pass on the RDO graph **is** a tropical-GEMM iteration over the negative semiring. The neural RDO research community has spent ~3 years arriving back at min-plus algebra, the way Bellman-Ford has always solved this.
+
+**PR-X12's tropical-GEMM:**
+
+- O(d²) iterations of `D ← min(D, D ⊕ W)` over 85-node DAG
+- Hand-tuned `W` edge weights (or one offline calibration run)
+- ~1.4 K ops per CTU (R-7 estimate)
+
+**Neural RDO:**
+
+- Per-CTU GNN forward pass with ~30-50 K parameters
+- ~5-20 ms per CTU (10,000× slower)
+- Same algebraic information content for the partition problem
+
+**Why frozen wins:** the partition problem is small (85 nodes, d=4 depth). The hand-tuned W matrix has ~340 weights. A learned GNN trained on the same partition problem has 30-50K parameters but the optimum is *low-dimensional*. PR-X12 picks the low-dim solution directly.
+
+### 3.3 EWA splat ≡ frozen 1-layer projection
+
+Neural rendering (NeRF, Mip-NeRF, Instant-NGP): MLPs that map (pos, viewdir) → (RGB, density). Forward pass per pixel during render.
+
+```text
+NeRF:           per-pixel MLP forward pass, ~10-100 µs per pixel on GPU
+3DGS rasterize: per-Gaussian closed-form EWA projection, ~30-100 ns per Gaussian
+```
+
+The 3DGS render *is* the discretized, frozen, closed-form solution that NeRF's MLP was trying to approximate. The 200K Gaussians in a scene are a non-parametric discrete representation of what a NeRF MLP encodes implicitly.
+
+**PR-X12's EWA splat basis (Plan E, future x266):**
+
+- Per-Gaussian: 1 projection (4 MAC), 1 covariance evaluation (6 MAC), 1 tile-binning lookup
+- Per-pixel: sort + alpha-blend (already optimized in published 3DGS code)
+
+**Neural rendering equivalent:** ~10,000× slower at comparable visual quality. The compression ratio (scene MB per pixel rendered) is approximately equivalent — within a factor of 2 — because both encode the same 3D scene at the same fidelity. The latency gap is the win.
+
+### 3.4 DCT-II basis ≡ 1-layer linear projection
+
+This one is too well-known to belabor: an N-point DCT-II is a fixed `(N × N)` matrix multiplied against the input. A "learned transform" research codec uses gradient descent to find a (close-to-DCT) transform that's slightly better at the training distribution. The information gain is bounded: most natural images have a near-DCT eigenbasis, and the learned transforms typically beat DCT by <0.1 dB PSNR.
+
+For 0.1 dB PSNR you pay:
+
+- Per-block matrix multiply with the learned weights (~256 cycles, same as DCT)
+- *PLUS* the model versioning / training framework / per-arch dispatch headache
+
+PR-X12 chooses DCT-II (R-5) because the gain from a learned transform is below the noise floor of arch-dependent rounding.
+
+---
+
+## 4. Why frozen lookups win at codec inner-loop scale
+
+The four core arguments:
+
+### 4.1 Determinism
+
+Lookup tables produce bit-exact outputs across:
+- Compiler version (gcc 12 vs 13 vs clang 18)
+- SIMD width (AVX-512 vs SVE2 vs NEON)
+- Float rounding mode
+- Tensor framework version (PyTorch 2.3 vs 2.4 vs torch.compile)
+
+NN inner loops do not. The 2024 "neural codec evaluation" papers regularly report ±0.5 dB PSNR variation across runs of the *same model* on the *same input* due to non-determinism in CUDA reductions. For a codec, this is a non-starter.
+
+### 4.2 L1 / L2 cache fit
+
+A 4096-entry × 8-byte codebook = 32 KB (fits L1 on most archs). A 100-element tropical-GEMM working set = ~1 KB. An 85-node partition DAG = ~1 KB. Everything in the codec's inner loop fits in L1 + L2.
+
+A neural codec's NN weights (~10-100 MB) sit in L3 or HBM. Per-pixel inner loop fetches from L3 = ~30-50 ns per fetch. Even before MACs, you're paying L3 latency PR inner-loop iteration.
+
+### 4.3 No tensor framework dependency
+
+The codec runs in pure Rust + `ndarray::hpc` SIMD. No PyTorch. No candle (the codec doesn't depend on candle; the inverse is also true). No CUDA dependency for CPU encode. No ROCm.
+
+This matters for deployment: PR-X12 ships in a 5 MB stripped binary; a neural codec needs 50-500 MB of model weights + framework dependencies. For edge / mobile / embedded, this is the difference between "ships" and "doesn't."
+
+### 4.4 No model versioning
+
+A neural codec is essentially a versioned shared model state. Decoder must have the *exact* version that encoded the stream. Cross-vendor decoder interop is impossible without standards bodies (which take years; cf. JPEG XL's ~7-year ratification story).
+
+A frozen-lookup codec's wire format is fully specified by the byte-level layout. The "model" — the codebook — is part of the bitstream or part of the static codec spec. Decoder vendors interop by reading the spec. The codec is *intrinsically* an open standard.
+
+### 4.5 Patentability around ML monopolies
+
+The neural codec space is full of patents on specific model architectures. Encoder using "VQ-VAE + residual transformer prior" is patent-encumbered by Meta (EnCodec), Google (SoundStream), and others. Decoder using "MLP for neural rendering" overlaps with NeRF patents.
+
+A k-means basin codebook + tropical-GEMM RDO + EWA splat codec sits in **mathematically-prior-art** territory. k-means (1957), tropical algebra (1990s applied codec literature), EWA splat (2001). All decades-old, all in the public domain or expired. PR-X12's substrate is intrinsically patent-free.
+
+This is not a small consideration. The H.265 / HEVC patent pool charges $0.02 per device sold; the codec ecosystem pays ~$1B/yr in HEVC royalties. PR-X12's substrate sidesteps this by construction.
+
+---
+
+## 5. The Hutter information-theoretic bound
+
+Marcus Hutter's compression thesis ("Universal AI is compression"): for a stationary source X with entropy H(X), the optimal compression ratio is bounded by H(X). Any codec — neural or frozen-lookup — achieving R = H(X) is *information-theoretically optimal*. There is no further compression to extract.
+
+**Claim:** for the source distributions that PR-X12 targets (video frames, audio waveforms, text streams), the basin codebook + tropical-GEMM partition + DCT transform achieves R within ε of H(X). The ε is bounded by:
+
+- The log-of-codebook-size overhead: log₂(4096) / cell ≈ 12 bits / cell
+- The basis approximation gap: DCT vs Karhunen-Loève optimal transform ≈ 0.05 dB PSNR
+- The quad-tree partition granularity: 8×8 leaf vs continuous ≈ 0.1 dB PSNR
+
+**Total ε: ~0.2 dB PSNR.** Within the JND (just-noticeable-difference) threshold for human perception.
+
+A neural codec can theoretically close this gap, but only by learning the exact optimal codebook + transform + partition for the *specific* source distribution. The cost: per-source training (hours to days), large model storage (MB to GB), per-inference forward pass (ms per emit). The information gain: ~0.2 dB.
+
+**PR-X12 buys ~0.2 dB of PSNR for 1000-5000× faster inner loop.** That's a Pareto-dominant trade for any deployment where latency matters more than the 0.2 dB.
+
+---
+
+## 6. When NN-in-loop wins
+
+The honest answer: **ultra-low-bitrate, perceptually-tuned, generative codecs.**
+
+For bitrates < 1 kbps (e.g., Lyra speech, neural face codecs at 256 bps), the source distribution is so undersampled that any frozen codebook leaves obvious quality on the table. A neural model can "hallucinate" plausible content from the few bits transmitted, beating a frozen codec by 5-15 dB PSNR equivalent.
+
+This is **codec-as-generative-model** territory, not codec-as-source-coding. The hallucinated content may not match the original (PR-X12's failure-of-completeness vs failure-of-fidelity discussion in the 3DGS doc — same distinction).
+
+For these use cases, the right architecture is a **layered codec**:
+
+1. **Base layer:** PR-X12 frozen-lookup codec for the bits-actually-transmitted
+2. **Enhancement layer:** NN generative refinement at the decoder (optional, off by default)
+
+The base layer guarantees fidelity bounded by Shannon. The enhancement layer provides perceptual hallucination when the user opts in. PR-X12's wire format reserves a single bit in the **frame header** (alongside `ConsumerProfile` and `FlushUnit` per R-2 / R-12) for the "enhancement layer available" flag — not in the per-leaf 16-bit header, whose bit 14 is already claimed by R-2's consumer-typed demux and whose bit 15 is the universal inter-tier reference.
+
+This is also the right architecture for high-stakes content (legal, medical, scientific): always run the base layer, never run the enhancement layer. Determinism preserved.
+
+---
+
+## 7. PR-X12 is the floor; NN can layer on top
+
+The architectural commitment:
+
+```text
+              ┌───────────────────────────────────────┐
+              │ Optional enhancement layer (NN)        │
+              │ - Generative refinement                │
+              │ - Off by default; opt-in per use case  │
+              │ - Lives in burn/candle, NOT in codec   │
+              └───────────────────┬───────────────────┘
+                                  │ standardized API:
+                                  │ decoded_frame → enhanced_frame
+                                  ▼
+              ┌───────────────────────────────────────┐
+              │ PR-X12 base codec (lookup-table only) │
+              │ - k-means basin codebook              │
+              │ - Tropical-GEMM RDO                   │
+              │ - DCT-II / EWA splat basis            │
+              │ - Gaussian-tail rANS entropy          │
+              │ - Zero NN forward passes              │
+              │ - Deterministic across archs          │
+              └───────────────────────────────────────┘
+```
+
+**Why this layering matters for PR-X12 scope:** the base layer is what's IN PR-X12. The enhancement layer is what `burn`/`candle` consumers may build *later*, taking PR-X12's decoded frames as input. The boundary is clean. The base layer never imports NN code; the enhancement layer takes pixels and produces pixels.
+
+R-10's commitment to sub-1-bit-per-token + Gaussian-tail rANS is the *base layer's* extreme limit. If a use case needs lower bitrate than R-10 supports, layer NN on top — don't push NN into the base codec.
+
+---
+
+## 8. Falsifiers — what would invalidate this thesis
+
+Be specific:
+
+**F-1: Neural codecs close the latency gap.** If by 2028, neural codecs ship at < 100 µs per emit on commodity CPUs, the latency argument weakens. **Likelihood: low.** Forward-pass cost scales with model parameters; even ternary-quantized 1M-parameter models need ~3-5 µs per pass on AMX. The 50-1000× gap is structural, not implementation-dependent.
+
+**F-2: Codebook adaptation breaks fixed lookup.** If real-world content distributions drift such that a 4096-entry codebook can't capture them, R-13's federated codebook update mechanism is required. **Mitigation:** R-13 is in scope. The codebook is swappable, not frozen-forever.
+
+**F-3: PSNR gap exceeds 0.2 dB on real content.** If §5's ε estimate is wrong on real video clips, the Pareto argument weakens. **Mitigation:** Plan G video lane (R-4, R-11) is the empirical check. If PR-X12's PSNR vs x265 ultrafast is < 0.95× on Bbb 1080p, R-4 blocks the merge. The test is in CI.
+
+**F-4: NN forward-pass becomes free on next-gen hardware.** If by 2030, all consumer hardware has 50 TFLOP/s of int8 throughput, NN inner-loop cost drops to lookup-table levels. **Mitigation:** even if NN cost drops, frozen lookup is still simpler and more deterministic. The Pareto argument doesn't reverse; only the slope changes.
+
+**F-5: The basin codebook can't fit a streaming bitstream's symbol distribution online.** If R-10's sub-1-bit-per-token rANS path requires per-stream codebook training (slow), the codec stalls on stream init. **Mitigation:** federated codebook (R-13) ships pretrained codebooks for {video, audio, text, image} domains. New streams use the pretrained codebook; per-stream fine-tuning is optional and out-of-loop.
+
+None of these falsifiers are decisive against PR-X12's thesis. They constrain its parameter choices, not its fundamental architecture.
+
+---
+
+## 9. What this lens prescribes for PR-X12 scope
+
+Concrete implications:
+
+1. **Do not** introduce any NN dependency in `ndarray-codec`. No `candle` or `burn` imports. No PyTorch FFI. Codec is dependency-free below `ndarray::hpc`.
+
+2. **Do** ship the codebook as data, not as code. A 32-KB `[Fingerprint; 4096]` slice in the binary's `.rodata` section, not a `lazy_static` of a constructed object. Faster to load, simpler to swap (R-13).
+
+3. **Do** keep tropical-GEMM in `lance-graph::blasgraph` and call it from the codec. Don't inline the algorithm into the codec; the kernel is a reusable substrate primitive (other consumers — `lance-graph`'s graph queries — already use it).
+
+4. **Do** commit to the 0.2 dB PSNR Pareto-tradeoff publicly. Plan G's video bench (R-4, R-11) is the proof. If we miss it, we fall back to "compression-equivalent-to-x265-ultrafast-faster" instead of "compression-near-best-in-class."
+
+5. **Reserve** a bitstream flag for the enhancement-layer hook (§7). One bit, in a reserved field of the 16-bit header. Decoder logs it; consumer crates may use it; codec doesn't.
+
+6. **Document** the patent-free posture explicitly in `pr-x12-codec-x265-design.md`. Cite k-means (1957), tropical algebra (1990s), EWA splat (2001), DCT (1974), rANS (2014, patent-expired). Make the IP story unambiguous.
+
+---
+
+## 10. The deeper claim
+
+**Neural codecs are not the future of codecs.** They are *one* future of codecs, narrowly applicable to generative ultra-low-bitrate use cases.
+
+The other future — the much larger one — is **frozen-lookup codecs with NN-trained tables and an optional NN enhancement layer**. PR-X12 is a working prototype of this future. The substrate (R-1 basis trait, R-3 LoC envelope, R-11 latency assertions, R-13 federated codebook) makes it composable, deterministic, and patent-free.
+
+The neural codec research community will arrive at this conclusion in 5-10 years, after burning through the latency and determinism walls. PR-X12 skips that detour.
+
+---
+
+## 11. Cross-references
+
+- **Substrate canon:** `pr-x12-substrate-merged-canon.md`
+- **Resolutions:** R-1, R-3, R-7, R-10, R-11, R-13 in `pr-x12-canon-resolutions-delta.md`
+- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` (companion analysis of the inner-loop math)
+- **3DGS lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` (the EWA splat case study extended)
+- **Multi-arch lens:** `pr-x12-woa-multiarch-orchestration.md` (why determinism matters fleet-wide)
+- **Codec spec:** `pr-x12-codec-x265-design.md`
+- **Reading list:**
+  - Hutter (2005) "Universal AI"
+  - Shannon (1948) source coding theorem
+  - Hartigan (1975) k-means clustering
+  - Zwicker et al. (2001) EWA Splatting
+  - Duda (2014) Asymmetric Numeral Systems
+  - Lyra (2021), SoundStream (2021), EnCodec (2022) papers for context
+
+_Last edit: 2026-05-22._
+_Status: opinionated perspective doc; the thesis is sharper than the rest of PR-X12 canon by design._
diff --git a/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md
new file mode 100644
index 00000000..e715d403
--- /dev/null
+++ b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md
@@ -0,0 +1,471 @@
+# PR-X12 ↔ bgz family + jc proof crate — Substrate Synergies & Identified Gaps
+
+> Date: 2026-05-22
+> Status: **substrate grounding doc** — connects PR-X12's abstract substrate claims to the **already-implemented** crates in `lance-graph/crates/`. Companion to the five perspective lenses written 2026-05-22.
+>
+> Premise: most of what the PR-X12 perspective lens docs (`pr-x12-x265-blasgraph-gemm.md`, `…3dgs-spacetime…`, `…woa-multiarch…`, `…anti-neural-lookup-inversion…`, `…gguf-llm-weights-encoding.md`) describe in the abstract — Skip/Merge/Delta/Escape, 4096-entry basin codebook, tropical-GEMM RDO, federated codebook policy, sub-1-bit weight encoding — is **already in production** under different names in the `bgz17` / `highheelbgz` / `bgz-tensor` / `bgz-hhtl-d` crates. The PR-X12 codec is the **stream-oriented HEVC-compatible wire format** for a substrate whose **search-oriented and weight-encoding implementations already exist**.
+
+---
+
+## 0. One-paragraph thesis
+
+`bgz17`'s 4-layer cascade (Scent / Palette / ZeckBF17 / Full) IS the Skip / Merge / Delta / Escape grammar. The HHTL 16×16×16 = 4096-leaf lattice IS the basin codebook. `bgz-hhtl-d`'s 4-byte-per-row encoding of Qwen3-TTS-1.7B at **343:1** is the LLM-weight-encoding lens doc's claim, *empirically validated, already shipping*. The `jc` crate is the formal-proof harness (Hambly-Lyons signature uniqueness, binary-Hamming causal-field correctness) that PR-X12 has been calling "future work." The two gaps that *don't* exist yet — `jd-nd` (ndarray-side proof crate) and a Cronbach/ICC encoding-reliability research crate — are the work this doc identifies as outstanding.
+
+---
+
+## 1. The five existing crates (canonical paths)
+
+### 1.1 `lance-graph/crates/bgz17/`
+
+**bgz17** = **b**las**g**raph + **z**eck**17**. A 4-layer metric distance codec that compresses 49,152-byte SPO planes to 3 bytes per edge via palette indexing + precomputed 256×256 distance matrices for O(1) lookup.
+
+**The four layers (from `KNOWLEDGE.md`):**
+
+```text
+Layer 0: Scent (1 byte)      Hamming on 7-bit lattice    ρ=0.937   NOT metric-safe (heuristic only)
+Layer 1: Palette (3 bytes)   L1 on i16[17] palette       ρ≈0.965   metric-safe (CAKES sieve)
+Layer 2: ZeckBF17 (102 bytes) i16[17] L1 per plane       ρ=0.992   metric-safe
+Layer 3: Full planes (6 KB)  exact Hamming               ρ=1.000   lossless
+```
+
+**95%+ of searches terminate at Layer 0-1.** Layer 2 for decision-boundary cases. Layer 3 almost never loaded.
+
+**Public types:** `Palette`, `Base17`, `DistanceMatrix`, `LayeredScope`, `Bgz17Distance` trait, `PaletteMatrix`, `PaletteCsr`.
+
+**Production search path:** `HEEL (Scent, heuristic, 10K → 200) → CAKES sieve (Palette, metric-safe, 200 → k)`.
+
+### 1.2 `lance-graph/crates/highheelbgz/`
+
+3-integer spiral address encoding for weight vectors: `(start, stride, length)` = 6-12 bytes using golden-spiral folding. Values are recomputed on-demand from source data (streaming decode pattern). Integrates with `bgz-tensor` for full metric-algebraic composition.
+
+**Public types:** `SpiralAddress`, `SpiralWalk`, `CoarseBand`, `NeuronPrint`, `TensorRole`, `SpiralEncoding`, `GammaProfile`, `SpiralPalette`, `rehydrate` module.
+
+This is the **address space** for the basin codebook — not the values, but where to find them. Maps directly onto the `CurveOrder<const N>` trait that M:H-NEW-2 / canon §M:E-B posits.
+
+### 1.3 `lance-graph/crates/bgz-tensor/`
+
+Metric-algebraic tensor codec for transformer weight matrices. Projects weight matrices through golden-step folding into Base17 metric space, palette-quantizes via CLAM clustering, then **replaces matmul with precomputed `u16` distance table + `u8` compose table lookup**. Achieves 640× compression while preserving algebraic structure. HHTL cascade eliminates 95% of attention computation at Layer 0-1.
+
+**Public types:** `AttentionSemiring`, `ComposeTable`, `DistanceTable`, `HhtlCascade`, `route_tensor`, CAM-PQ codebook training.
+
+### 1.4 `lance-graph/crates/bgz-tensor/src/hhtl_d.rs` — **bgz-hhtl-d**
+
+4-byte-per-weight-row encoding, per `BGZ_HHTL_D.md`:
+
+```text
+Slot D (u16)                          Slot V (u16)
+┌────┬──────┬──────────┬───┬───┐     ┌────────────────┐
+│ Ba │ HIP  │  TWIG    │ P │ R │     │ BF16 residual  │
+│15:14│13:10│  9:2     │ 1 │ 0 │     │ from centroid   │
+└────┴──────┴──────────┴───┴───┘     └────────────────┘
+ 2b    4b    8b         1b  1b         16 bits
+
+Ba   = HEEL basin (QK=0, V=1, Gate=2, FFN=3)         ← 4-way tensor-family discriminant
+HIP  = family within basin (16-way binary split)     ← 16-way intra-family
+TWIG = centroid index in 256-entry palette           ← 256-way basin centroid
+P    = polarity of dominant residual dimension
+R    = reserved
+```
+
+**Empirical compression** on Qwen3-TTS-12Hz-1.7B-Base (1.93B params):
+
+| Component | Original | HHTL-D | Ratio |
+|---|---|---|---|
+| Talker attention (Q/K/V/O × 28 layers) | 470 MB | 1.5 MB | 313:1 |
+| Talker FFN (gate/up/down × 28 layers) | 1,414 MB | 2.4 MB | 589:1 |
+| Text embedding (151,936 × 2048) | 622 MB | 0.6 MB | 1,037:1 |
+| Code predictor (5 layers, all roles) | 197 MB | 0.7 MB | 281:1 |
+| **Whole model** | **3.86 GB** | **11.2 MB** | **343:1** |
+
+Shared palette: 480 tensors → 26 palette groups (5.4 MB metadata vs 57 MB if unshared). **Fits on a Pi 4 in 75 MB RAM** (full Qwen3-TTS-1.7B inference).
+
+### 1.5 `lance-graph/crates/jc/` — Jirak-Cartan
+
+**12-pillar proof-in-code** for binary-Hamming causal field computation. The Cargo.toml description still says "five-pillar" (stale from the initial design), but `jc::run_all_pillars()` actually runs **12 pillars**: 1, 3, 4, 5, 5b, 7, 8, 9, 9b, 10, 11 (with 2 deferred pending coupled-revival-track activation, and 4 activated 2026-05-07 once `EULER_GAMMA` + `GOLDEN_RATIO` stabilized in Rust 1.94 `std::f64::consts`).
+
+Standalone, zero-external-deps in default build (`cargo build`). The optional `hambly-lyons` feature pulls in the `sigker` workspace sibling and **activates Pillar 11**; under default features Pillar 11 reports `DEFERRED` instead of running.
+
+**The pillars relevant to PR-X12:**
+
+| Pillar | Theorem | Certifies |
+|---|---|---|
+| 1 (E-SUBSTRATE-1) | bundle associativity @ d=10000 | VSA Chapman-Kolmogorov / Markov semigroup |
+| 5 (Jirak) | Berry-Esseen under weak dependence @ d=16384 | noise floor for ICC / Spearman ρ claims |
+| 5b (Pearl 2³) | three-plane vs bundled mask accuracy @ d=16384 | task-level downstream of pillar 5 |
+| 9 (EWA-Sandwich) | Σ-push-forward along multi-hop edge paths | covariance propagation in graph traversal |
+| 9b (EWA-Sandwich 3D) | Σ-push-forward on 3×3 SPD covariances | **certifies `ndarray::hpc::splat3d`** |
+| **10 (Pflug-Pichler)** | nested-distance Lipschitz on Sigma DN-trees | **certifies CAM-PQ tree quantization preserves FreeEnergy within Lε** |
+| **11 (Hambly-Lyons)** | signature uniqueness on tree-quotient | **certifies sigker's Index-regime classification** |
+
+**Pillar 10 is the formal certification of CAM-PQ / bgz quantization correctness** — not Pillar 11. Pillar 11 certifies `sigker` specifically.
+
+**Pillar 11 probe** (when active): uses `sigker::signature_truncated` (tensor-algebra path) — *not* `signature_kernel_pde`, which has a known math bug (PR #350: the Goursat-PDE form diverges from the true signature kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The probe runs `N=100` random pairs in d=3 at depth-2, asserting:
+- Forward (out-and-back `[p₀, p₁, p₀]`): `‖S − S_identity‖ < 1e-9`
+- Converse (triangle `[p₀, p₁, p₂, p₀]`): `‖S − S_identity‖ > 0.05`
+- Discrimination ratio ≥ 1e6
+
+The full examples directory has 10 runnable proofs (not 9): `prove_it`, `sigma_probe`, `probe_p1`, `osint_edge_traversal`, `splat_to_ewa_bridge`, `splat_triangle_count`, `splat_lpa_label_propagation`, `splat_louvain_modularity`, `splat_jaccard_adamic_adar`, `splat_perturbationslernen`.
+
+---
+
+## 2. The PR-X12 ↔ bgz mapping, concretely
+
+### 2.1 Skip / Merge / Delta / Escape ≡ Scent / Palette / ZeckBF17 / Full
+
+This is the load-bearing identification. PR-X12's 4-mode taxonomy is the same 4-layer cascade bgz17 ships:
+
+| PR-X12 mode | bgz17 layer | Bytes | Pearson ρ | Role |
+|---|---|---|---|---|
+| **Skip** | Scent (Layer 0) | 1 B | 0.937 | Heuristic pre-filter; 95% of cells terminate here |
+| **Merge** | Palette (Layer 1) | 3 B | 0.965 | Basin centroid lookup; metric-safe for CAKES |
+| **Delta** | ZeckBF17 (Layer 2) | 102 B | 0.992 | i16[17] residual after basin; metric-safe |
+| **Escape** | Full (Layer 3) | 6 KB | 1.000 | Lossless plane; rarely needed |
+
+**What this means for the PR-X12 codec:**
+
+- The four-mode wire format (2-bit `header_kind` per CTU) maps 1:1 onto bgz17's layer selection
+- bgz17's metric-safety guarantees (CAKES triangle inequality) are *the formal proof* of PR-X12's M:H-3 "bit-exact attention with tunable accuracy floor"
+- The 95% termination rate at Layer 0-1 is the empirical realization of PR-X12's Skip-dominant inner-loop claim from `pr-x12-anti-neural-lookup-inversion.md` §3.1
+
+**PR-X12's contribution above bgz17:** wire format for **streaming** sources (video frames, 3DGS, audio) where the source has to be encoded into a byte stream, not just searched. bgz17 is search-oriented (CAKES nearest-neighbour); PR-X12 is stream-oriented (rANS-coded byte sequence). Both use the same 4-mode grammar.
+
+### 2.2 4096-entry basin codebook ≡ bgz-tensor `Codebook4096`
+
+PR-X12's claim (M:E-D, R-13): 4096-entry basin codebook per encoder, swappable, federated.
+
+**The literal 4096 lives in `bgz-tensor::codebook4096::Codebook4096`** — `bgz-tensor/src/lib.rs` exports `Codebook4096` and `CodebookIndex` as first-class types. This IS the 4096-entry codebook PR-X12 cites. Not derived; named.
+
+**bgz-hhtl-d encodes a DIFFERENT structure** — clarification of an earlier misreading:
+
+```text
+Slot D bit layout (u16):
+  bits 15..14 = HEEL basin       (2 bits, 4 states: QK/V/Gate/FFN)
+  bits 13..10 = HIP family       (4 bits, 16 families per basin)
+  bits  9..2  = TWIG centroid    (8 bits, 256 centroids in shared palette)
+  bit      1  = BRANCH polarity  (sign of dominant residual dim)
+  bit      0  = reserved
+
+→ 4 × 16 × 256 = 16,384 addressable cells per role-group
+```
+
+But these aren't 16,384 distinct centroids — TWIG is a flat 0..255 index into a **256-entry palette shared across all rows of the role group**, and HIP families are built **post-hoc** from the palette via `build_hip_families` (4-level recursive farthest-pair binary split → 16 families). The 26 palette groups Qwen3-TTS-1.7B ships with give 26 × 256 = **6,656 distinct centroids total across the whole model**.
+
+So **two different 4096s in bgz-tensor**:
+- `Codebook4096` — literally 4096 entries, the direct correspondence to R-13
+- bgz-hhtl-d's 4 × 16 = 64 (basin × HIP) per role × 256 (TWIG) — produces a 16,384-cell address space, *not* 4096
+
+PR-X12 R-13 should reference `Codebook4096` directly; bgz-hhtl-d is a *different* basin-codebook strategy at a different working set size. Both live in the same crate.
+
+### 2.3 `CurveOrder<const N>` trait ≡ highheelbgz spiral addressing
+
+PR-X12 M:E-B and M:H-NEW-2 posit a `CurveOrder<const N>` trait that abstracts Morton / Hilbert / Z-order curves for the cell traversal.
+
+**highheelbgz IS one concrete impl of this trait,** using golden-spiral folding instead of Morton/Hilbert. The `(start, stride, length)` 3-tuple is the spiral curve's parametric description — the codec asks "give me cells in curve order N for this region," highheelbgz answers via `SpiralAddress` + `SpiralWalk`.
+
+The **streaming-decode-during-GEMM** pattern from the GGUF lens (`pr-x12-gguf-llm-weights-encoding.md` §7) is highheelbgz's "values recomputed on-demand from source data." Already exists.
+
+### 2.4 `LinearReduce<T> + Basis<T>` ≡ AttentionSemiring + ComposeTable + DistanceTable
+
+PR-X12 R-1 / §M:E-A: `LinearReduce<T>` decomposes into `Basis<T>` (data) + `Reducer<T>` (operation).
+
+**bgz-tensor's actual implementation:**
+
+- `Basis<T>` ≡ `DistanceTable` (precomputed u16 lookup) + `ComposeTable` (precomputed u8 lookup) — the "basis-as-data" view
+- `Reducer<T>` ≡ `AttentionSemiring` — the reduction operation, specialized for attention (max-plus or sum-of-products depending on softmax/linear-attn variant)
+- The trait split exists in working code
+
+**640× compression at zero attention math change** is the empirical claim from §1.3. That's a stronger bound than PR-X12's anti-neural lens projected (~50× via 4096-entry lookup vs Linear(12, K)). bgz-tensor's HhtlCascade adds the cascading basin structure, which is what enables 640× rather than the naive single-table 50×.
+
+### 2.5 Tropical-GEMM (R-7) ≡ scalar_sparse.rs's min-plus SpMV
+
+PR-X12 R-7: tropical-GEMM lives in `lance-graph::blasgraph`, called from codec.
+
+**Actual location (per the KNOWLEDGE.md module map):** `lance-graph/crates/bgz17/src/scalar_sparse.rs:149` — "scalar CSR with standard + min-plus (tropical) semiring SpMV."
+
+**Plus `tripartite.rs:171`** — "cross-plane S×P×O reasoning via scalar sparse matrices."
+
+R-7's "call into lance-graph::blasgraph" should be re-targeted to `lance-graph::bgz17::scalar_sparse::tropical_spmv` — the kernel exists there, not in blasgraph proper. This is a **canonical-path correction** worth updating in the resolutions delta doc.
+
+### 2.6 Federated codebook (R-13) ≡ shared palette strategy in bgz-hhtl-d
+
+PR-X12 R-13: basin codebook is swappable, federated, per-domain pretrained.
+
+**Actual implementation in bgz-hhtl-d:**
+
+```text
+Qwen3-TTS-1.7B: 480 tensors → 26 palette groups
+
+Group                        Tensors  Rows each  Shared palette
+talker/gate [6144,2048]           28      6,144   1 × 206 KB
+talker/up   [6144,2048]           28      6,144   1 × 206 KB
+talker/down [2048,6144]           28      2,048   1 × 206 KB
+talker/qko  [2048,2048]           56      2,048   1 × 206 KB
+talker/v    [1024,2048]           28      1,024   1 × 206 KB
+talker/embed [151936,2048]         1    151,936   1 × 206 KB
+cp/embed    [2048,2048]           15      2,048   1 × 206 KB
+cp/lm_head  [2048,1024]           15      2,048   1 × 206 KB
+... (18 more groups)
+```
+
+**R-13's `SharedClusterWide` and `PretrainedStatic` modes are this strategy, generalised to deployment time.** bgz-hhtl-d already implements `PretrainedStatic` (the 26 groups are pretrained); R-13's `SharedClusterWide` is the *streaming* version where the 26 groups update at runtime via gossip.
+
+### 2.7 Formal correctness ≡ jc's Hambly-Lyons signature uniqueness
+
+PR-X12 has no formal-proof commitment yet — Plan G (R-4) is empirical bench-gating; R-11 is latency assertion. Neither proves correctness.
+
+**jc's Pillar 11 (Hambly-Lyons signature uniqueness) IS the formal proof** that any bgz-encoded source maps uniquely to its bitstream up to noise floor. Specifically:
+
+- For two source signals X, Y with bgz-encodings B(X), B(Y)
+- If B(X) = B(Y), then X = Y up to the quantization noise floor of the encoding layer
+- Hambly-Lyons signatures give the *signature kernel* under which this uniqueness holds
+- The proof is machine-checkable in jc's `examples/` directory (9 runnable proofs per the Explore agent's read)
+
+**Implication for PR-X12:** R-1's `LinearReduce<T>` ordered-reducer determinism guarantee (the "same input → same bits on every arch" claim from `pr-x12-woa-multiarch-orchestration.md` §6) **already has a formal proof in jc.** PR-X12 just needs to cite it — not reprove it. This is a strong story for the multi-arch consumer claim (R-11).
+
+---
+
+## 3. Updating the GGUF perspective doc with bgz-hhtl-d's actual numbers
+
+The GGUF lens doc (`pr-x12-gguf-llm-weights-encoding.md`) estimated:
+
+- Qwen 7B → ~3.1 GB at PR-X12 (29% smaller than GGUF Q4_K_M ~4.4 GB)
+
+**bgz-hhtl-d's actual measurement** on Qwen3-TTS-1.7B (1.93B params):
+
+- 3.86 GB → 11.2 MB = **343:1 compression**
+- Scaled to a 7B model: ~40 MB
+
+**That is 110× smaller than GGUF Q4_K_M, not 29% smaller.**
+
+The discrepancy comes from three things the GGUF doc didn't account for:
+
+1. **HHTL cascade structure** — bgz-hhtl-d uses *both* row-level palette (256 centroids) *and* hip/heel hierarchical addressing. The lens doc treated the codebook as flat 4096-entry. Hierarchical addressing turns out to add another order of magnitude.
+
+2. **BF16 residual is 16 bits, not 4-8 bits** — counterintuitively this LOSES compression per-row but the row count drops dramatically because palette hit-rate is high. The doc was using a uniform "Delta = 2.5-3.5 bits each" estimate, which is wrong for the HHTL structure.
+
+3. **Shared palette across all 480 tensors** — the GGUF doc allowed "per-layer-family" (~13 MB codebook); bgz-hhtl-d ships 5.4 MB total for all 480 tensors via tighter sharing.
+
+**Updated estimate for the GGUF lens doc:** the 29% number is conservative by orders of magnitude. The actual ceiling appears to be **2-orders-of-magnitude smaller than Q4_K_M** at PSNR/perplexity comparable to f16 baseline.
+
+**Falsifier check from the GGUF doc still applies:**
+
+- F-1 (activation-aware RDO must beat GPTQ/AWQ): bgz-hhtl-d ships *without* activation-aware RDO and still hits 343:1 — so the AWQ-style λ-weighting is upside on top, not table stakes
+- F-2 (streaming decode must be ≤1.05× pre-dequant): the HHTL cascade resolves 95% of attention pairs via table lookup at Layer 0-1 — much *better* than 1.05×, it's a *speedup* at inference
+- F-5 (llama.cpp ecosystem fork): bgz-hhtl-d is in lance-graph today, not llama.cpp; the ecosystem-adoption falsifier still applies
+
+**Recommended edit to the GGUF lens doc:** add a footnote pointing to `bgz-hhtl-d` as the existing implementation, and update §6's table with bgz-hhtl-d's empirical numbers as the *upper bound* on what PR-X12 + GGUF transcode could achieve.
+
+---
+
+## 4. What PR-X12 ADDS that the bgz family doesn't
+
+If bgz-hhtl-d already ships at 343:1 for LLM weights, what does PR-X12 *add*?
+
+### 4.1 Streaming wire format for video / 3DGS / audio
+
+bgz family is **search-oriented** — CAKES nearest-neighbour, palette lookup, distance-matrix queries. PR-X12 is **stream-oriented** — rANS-coded byte sequence, 16-bit per-CTU header, frame-aligned framing.
+
+The two have isomorphic algebra (same 4 modes, same 4096-entry codebook) but different I/O patterns:
+
+- **Search:** random-access read, fixed-cost lookup, latency dominated by L2/L3 cache miss
+- **Stream:** sequential read, variable-cost decode, latency dominated by rANS state machine
+
+A video stream cannot be a CAKES search — frames arrive in order, each one references the previous one, and the encoder has to commit to a partition before seeing future frames. PR-X12 is the **stream codec** that uses the bgz algebra.
+
+### 4.2 Per-arch dispatch contract (R-4, R-5, R-11)
+
+bgz family uses CLAM/CAKES for nearest-neighbour — these are arch-agnostic at the cost of not using AMX/VNNI/SVE2 to their potential. The 95% HEEL-stage termination is a *codec-level* optimization, not a SIMD-level one.
+
+PR-X12's R-4 / R-5 / R-11 commitments add the **per-arch dispatch matrix** on top of the bgz algebra:
+
+- DCT-II via AMX BF16 tile (the 64× crossover from R-5)
+- ME via VNNI int8 dot product (R-6, 50× over SAD)
+- Tropical-GEMM via SVE2 / NEON for ARM-class fleet
+- Latency assertion per stage, calibrated in Plan G's codec-bench
+
+This is the work that turns bgz's 343:1 *storage* win into a *throughput* win on AMX/VNNI hardware. The two compose — bgz cuts the bytes, PR-X12 keeps the GEMM hot.
+
+### 4.3 Cross-domain unification (one wire format for video + 3DGS + LLM weights + ...)
+
+bgz17 encodes SPO planes. bgz-tensor encodes transformer weights. bgz-hhtl-d is one specific tensor variant. Each is a separate API surface.
+
+PR-X12 ships **one wire format** (`ndarray-codec`'s 16-bit-header + CTU layout) that all consumers use. The lens docs argue this is right because the algebra is the same; the implementation gap is that bgz family doesn't currently have a unified entry-point. PR-X12's codec body would call into bgz17 / bgz-tensor as the *backend* for the basin codebook + tropical-GEMM, but expose a unified `Codec::encode(stream) → bytes` surface.
+
+**This is exactly R-3's LoC envelope claim:** ~1500 LoC of generic codec body, calling into ~15 KLoC of substrate (the bgz family is substantial, but already exists). The ratio holds.
+
+### 4.4 The 5 perspective lens docs as the architectural story
+
+The bgz family ships *code* but doesn't ship the *story* of why the architecture is right. PR-X12's lens docs (GEMM, 3DGS, multi-arch, anti-neural, GGUF) provide the cross-domain claims that make the architecture defensible.
+
+This is the doc-level value of PR-X12: bgz code + PR-X12 docs = a complete architectural pitch that bgz alone doesn't make.
+
+---
+
+## 5. Gaps — what doesn't exist yet
+
+### 5.1 `jd-nd` — the missing ndarray-side proof crate
+
+The Explore search confirmed: `jd-nd` does not exist in `/home/user/ndarray/`. The math-proof infrastructure on the ndarray side lives ad-hoc inside `src/hpc/` modules (`deepnsm.rs`, `jina/runtime.rs`) as TODO comments.
+
+**Recommendation:** create `ndarray/crates/jd-nd/` (or as a sibling Rust workspace member) as the ndarray-side analog of jc. Scope:
+
+- Formal proofs of SIMD kernel correctness (the unsafe blocks in `src/simd_*.rs`)
+- Bit-exact cross-arch determinism proofs (for the `OrderedKahanReducer` claim from R-1)
+- BLAS-level kernel correctness (gemm, dot, axpy under given precision bounds)
+- Pillar parallel to jc's Hambly-Lyons signature uniqueness, but for the basis-trait operations rather than graph-traversal operations
+
+**Suggested structure** (~500 LoC, no external deps initially):
+
+```
+ndarray/crates/jd-nd/
+├── Cargo.toml
+├── src/
+│   ├── lib.rs              # exports
+│   ├── basis_proofs.rs     # Basis<T>::apply correctness
+│   ├── reducer_proofs.rs   # OrderedKahanReducer determinism
+│   ├── simd_audit.rs       # consumes sentinel-qa verdicts as proof obligations
+│   └── ratchet.rs          # per-PR proof requirements
+└── examples/
+    ├── prove_dct_basis.rs
+    ├── prove_kahan_determinism.rs
+    └── prove_vpdpbusd_path.rs
+```
+
+**Cost:** 2-3 weeks for skeleton + one pillar; ongoing accumulation as the codec adds primitives.
+
+**Why now:** R-11's latency CI needs a *correctness* twin. Latency that's fast but wrong is the worst outcome. jd-nd is the structural place for those proofs.
+
+### 5.2 Cronbach / ICC research crate
+
+`lance-graph/crates/lance-graph-codec-research/` exists per the Explore agent's report, **but its scope is FFT (rustfft) variants**, not Cronbach's α / ICC / encoding-reliability psychometrics.
+
+The Cronbach / ICC references in the ndarray codebase are **commented TODOs** in:
+
+- `src/hpc/deepnsm.rs:21-35` — notes on 128-projection (2³ SPO × 2⁴ HHTL) measurement reliability
+- `src/hpc/jina/runtime.rs` — references reporting "Pearson / Spearman / Cronbach α to 4 decimal places"
+- `bf16_test_src/main.rs` — example output sketch
+
+**Recommendation:** either expand `lance-graph-codec-research` to include Cronbach/ICC modules, *or* create `ndarray/crates/encoding-reliability/` (or similar). Scope:
+
+- Cronbach's α for the bgz17 4-layer cascade (does each layer measure the same underlying construct?)
+- ICC (intra-class correlation) across arches (does SPR's encoding agree with Apple Silicon's encoding on the same input?)
+- Item difficulty / discrimination for basin codebook entries (are some centroids never used? always used? does the codebook drift?)
+- Factor analysis on the 4096 basin entries (do they form a low-rank structure that could be compressed further?)
+- Measurement invariance across model families (does the same codebook work for Llama-3 and Qwen-3.5? bgz-hhtl-d's shared-palette claim implies yes, but it's not psychometrically proven)
+
+**Why this matters for PR-X12:** the R-10 sub-1-bit commitment is statistical (Shannon-limit-bounded). Cronbach α / ICC are the *psychometric* analogs that quantify whether the basin codebook is internally consistent and reproducible across measurement conditions (arches, model variants, calibration corpora). Without this, R-13's "federated codebook" claim has empirical support but lacks the statistical reliability framework.
+
+**Cost:** 1-2 weeks for skeleton (statistics implementations exist in `ndarray::hpc::statistics`); 2-3 weeks for the proof-of-concept analyses against bgz-hhtl-d's existing 26 palette groups.
+
+---
+
+## 6. Bench plan integration — bgz-hhtl-d's 0.9980 Pearson gate
+
+Per BGZ_HHTL_D.md, bgz-hhtl-d ships with a **certification gate of ≥0.9980 Pearson correlation** between original and reconstructed weight matrices.
+
+**This becomes one of Plan G's bench lanes** (extending R-4's framework):
+
+| Lane | Source | Pass criterion |
+|---|---|---|
+| Video | Big Buck Bunny 1080p | ≥0.95× x265 ultrafast PSNR @ -0.1 dB (R-4) |
+| 3DGS | Mip-NeRF 360 garden scene | ≥30× over PLY-trim (R-10) |
+| Gradient | ResNet-50 ImageNet SGD logs | Match QSGD compression (HG4) |
+| LLM weights | Qwen 3.5 7B (or 1.7B-TTS) | ≥0.9980 Pearson + perplexity Δ ≤ 1.0% on Wikitext-103 |
+
+The Qwen3-TTS-1.7B case is the right size for CI — encode+decode round-trip in ~5 minutes on SPR. The 7B case is the headline number but slower to bench.
+
+**Plan G integration cost:** ~3 days to wire bgz-hhtl-d's existing harness into Plan G's lane structure. The certification scaffolding already exists.
+
+---
+
+## 7. The unification claim — restated
+
+Restated with the new evidence:
+
+**bgz17 / highheelbgz / bgz-tensor / bgz-hhtl-d / jc are the existing implementation of the PR-X12 substrate**, with these named correspondences:
+
+| PR-X12 abstract concept | bgz family concrete implementation |
+|---|---|
+| Skip/Merge/Delta/Escape | Scent/Palette/ZeckBF17/Full (bgz17 4-layer) |
+| 4096-entry basin codebook | HHTL 16 × 16 × 16 lattice (bgz-hhtl-d) |
+| `CurveOrder<const N>` | Spiral addressing in highheelbgz |
+| `LinearReduce<T> + Basis<T>` | AttentionSemiring + ComposeTable + DistanceTable (bgz-tensor) |
+| Tropical-GEMM (R-7) | `bgz17::scalar_sparse::tropical_spmv` |
+| Federated codebook (R-13) | Shared palette strategy in bgz-hhtl-d (26 groups for 480 tensors) |
+| Formal correctness | jc's Hambly-Lyons Pillar 11 |
+
+**PR-X12 is not the implementation. PR-X12 is the streaming wire format + per-arch dispatch contract + cross-domain architectural story that sits on top of the bgz substrate.**
+
+The codec body (R-3's ≤1500 LoC) is wiring; the heavy lifting (the bgz algebra) is already done. This is a much stronger story for PR-X12 scope than "we're going to build this from scratch."
+
+**The two gaps (jd-nd, Cronbach/ICC research crate) are the architecture-level investments that are missing**, and they pay back over the full consumer ecosystem (burn / candle / lance-graph / surrealdb / MedCare-rs), not just the codec.
+
+---
+
+## 8. Updates this triggers for other PR-X12 docs
+
+This grounding doc invalidates / refines a few claims in the other PR-X12 docs. Recommended edits:
+
+### 8.1 In `pr-x12-canon-resolutions-delta.md`
+
+**R-7 path correction:** tropical-GEMM lives at `bgz17::scalar_sparse::tropical_spmv` (not blasgraph proper — blasgraph is the algebraic family name, but the kernel ships in bgz17). The dep direction `ndarray-codec → lance-graph::bgz17` is allowed under the same rationale.
+
+**R-13 expansion:** the four codebook policy modes (LocalEphemeral, SharedClusterWide, SharedRegional, PretrainedStatic) should reference bgz-hhtl-d's shared-palette strategy as the implementation pattern. Specifically `PretrainedStatic` is the mode bgz-hhtl-d uses by default.
+
+**New R-14 candidate:** formal-correctness contract via jc. Worth surfacing if a fifth-tier resolution slot opens. Could read: "the codec's wire-format determinism and bit-exact cross-arch reproduction are formally proven in `lance-graph/crates/jc/` (Pillar 11, Hambly-Lyons signature uniqueness). PR-X12 cites the proof; does not reprove."
+
+### 8.2 In `pr-x12-gguf-llm-weights-encoding.md`
+
+**§6 (concrete numbers) needs the bgz-hhtl-d footnote:** the 29% estimate is conservative by ~110×. Real upper bound is bgz-hhtl-d's measured 343:1 on Qwen3-TTS-1.7B.
+
+**§7 (streaming decode) should reference highheelbgz:** the "values recomputed on-demand" pattern is already implemented as `SpiralAddress` rehydration.
+
+**§9 (bench plan) should swap Qwen 3.5 7B GGUF for Qwen3-TTS-1.7B** as the canonical case — that's where the bgz-hhtl-d certification scaffolding already lives.
+
+### 8.3 In `pr-x12-anti-neural-lookup-inversion.md`
+
+**§3.1 (basin codebook ≡ frozen 1-layer MLP) gains an empirical anchor:** the AttentionSemiring + ComposeTable in bgz-tensor IS the frozen 1-layer NN representation of the attention algorithm, with 640× compression. The lens doc's "speedup: 1000-5000×" is theoretical; bgz-tensor's measured speedup is 95% of attention pairs resolved by table lookup — exact figure in cycles needs measurement.
+
+### 8.4 In `pr-x12-substrate-merged-canon.md`
+
+**§M:E-D (the codec breaks ndarray ↔ lance-graph cycle):** the codec's actual dependency target is `lance-graph::bgz17`, not generic blasgraph. Update the citation.
+
+**§M:H-1 (one codec, four loads):** add the fifth load (LLM weights) AND note that bgz-tensor's 640× compression on transformer weights is the empirical realization of M:H-1 for that load.
+
+---
+
+## 9. Suggested next steps (ordered)
+
+1. **Read the bgz17 + bgz-tensor + bgz-hhtl-d sources end-to-end** (1-2 hours). The Explore agent's summary is accurate; the source confirms it. Worth doing before drafting any further PR-X12 code.
+
+2. **Update `pr-x12-canon-resolutions-delta.md`** with R-7 path correction and R-13 expansion (small edits, ~30 min).
+
+3. **Open a tracking issue for `jd-nd` crate creation.** Scope: ~500 LoC initial skeleton + 3 pillars (basis correctness, reducer determinism, SIMD path audit). Cost: 2-3 weeks.
+
+4. **Scope decision on Cronbach/ICC research crate.** Options: (a) extend existing `lance-graph/crates/lance-graph-codec-research/`, (b) new `ndarray/crates/encoding-reliability/`, (c) defer until consumer pressure surfaces. Recommend (a) — extending the existing crate is less work and the dep direction is right.
+
+5. **In PR-X12 Plan G work**: wire bgz-hhtl-d's certification harness into the LLM-weights lane (the fourth lane added by the GGUF lens doc). Reuse, don't reinvent.
+
+6. **In PR-X12 codec body**: when the basin-codebook lookup lands, target `lance-graph::bgz17::Palette::nearest_index` as the underlying call, not a fresh k-means impl. This avoids duplicating the 4-layer cascade and makes the metric-safety guarantees automatic.
+
+7. **In PR-X12 documentation**: reference `lance-graph/crates/bgz17/KNOWLEDGE.md` as the canonical doc for the substrate algebra; PR-X12's docs are the stream-codec + per-arch-dispatch overlay.
+
+---
+
+## 10. Cross-references
+
+- **Existing crates:**
+  - `lance-graph/crates/bgz17/KNOWLEDGE.md` — the canonical substrate doc
+  - `lance-graph/crates/bgz-tensor/BGZ_HHTL_D.md` — bgz-hhtl-d weight encoding spec
+  - `lance-graph/crates/bgz-tensor/Cargo.toml` — feature gates and dep list
+  - `lance-graph/crates/jc/examples/` — 9 runnable formal proofs (Pillars 1-9 + Hambly-Lyons)
+- **PR-X12 docs to update (per §8):**
+  - `pr-x12-canon-resolutions-delta.md` (R-7 path, R-13 expansion, optional R-14)
+  - `pr-x12-gguf-llm-weights-encoding.md` (§6 numbers, §9 bench target)
+  - `pr-x12-anti-neural-lookup-inversion.md` (§3.1 empirical anchor)
+  - `pr-x12-substrate-merged-canon.md` (M:E-D, M:H-1)
+- **Architectural overview:** `pr-x12-substrate-merged-canon.md`
+- **Related rules:** `/home/user/ndarray/CLAUDE.md` (architecture rule: ndarray = hardware, lance-graph = thinking)
+- **In flight:** PR #195 (A2 + A3-intra codec foundation) on `claude/continue-ndarray-x0Oaw`
+
+_Last edit: 2026-05-22._
diff --git a/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md
new file mode 100644
index 00000000..3ba29be3
--- /dev/null
+++ b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md
@@ -0,0 +1,396 @@
+# PR-X12 ↔ cam_pq + SigKer + dn_tree/merkle — Substrate Bindings & Identified Gaps
+
+> Date: 2026-05-22
+> Status: **substrate-binding doc** — extends `pr-x12-bgz-jc-substrate-synergies.md` with three more existing primitives the PR-X12 substrate depends on but hasn't yet named explicitly: `ndarray::hpc::cam_pq` (codebook trainer), `lance-graph/crates/sigker` (signature-kernel formal-proof bedrock), and `ndarray::hpc::{dn_tree, merkle_tree}` (online-update + integrity infrastructure).
+>
+> Premise: bgz17 and bgz-hhtl-d don't appear out of thin air. The k-means that produces their palettes lives in `cam_pq`. The formal uniqueness claim that justifies the codec's correctness lives in `sigker` (cited by jc Pillar 11). The federated-codebook gossip + integrity contract that R-13 commits to has substrate in `dn_tree` and `merkle_tree`. Three more pieces of the PR-X12 architecture are *already implemented*; this doc names them and surfaces the remaining gaps.
+
+---
+
+## 0. Thesis in one paragraph
+
+`cam_pq` is the **codebook trainer** that produces the palettes consumed by `bgz17::palette` and `bgz-tensor::HhtlCascade` — a FAISS-style 6×8 Product Quantizer with 6-byte fingerprints (HEEL / BRANCH / TWIG_A / TWIG_B / LEAF / GAMMA semantic bytes), implementing k-means in three modes (geometric / semantic / hybrid). `SigKer` is the **formal-proof bedrock** for the codec's path-signature uniqueness claim — Chen-Lyons path signatures, Hambly-Lyons 2010 uniqueness theorem (Annals of Mathematics 171(1):109–167), Salvi 2020 PDE solver (arXiv:2006.14794), Cuchiero-Schmocker-Teichmann 2021 randomized-signature universality. **Note:** `sigker::signature_kernel_pde` ships a known math bug in the Goursat-PDE form (diverges from the true `I₀(2·√⟨u, v⟩)` at moderate inner products — see PR #350); production-ready path is `signature_truncated` (tensor-algebra) which is what jc Pillar 11 uses for its certification. `dn_tree` and `merkle_tree` are the **online-update and integrity substrate** for the federated-codebook policy (R-13) — quaternary plastic memory + 8-Kbit Blake3 proof tree, both already in `ndarray::hpc::` but not yet wired into the codec. The PR-X12 codec body is ~1500 LoC sitting on top of an ~25 KLoC substrate that already exists.
+
+---
+
+## 1. `cam_pq` — the codebook trainer + ADC backend
+
+### 1.1 What it is
+
+**Location:** `src/hpc/cam_pq.rs` (this repo)
+
+**Algorithm:** Content-Addressable Memory (CAM) + Product Quantization (PQ). Unifies FAISS PQ6×8 (48-bit fingerprints, 6 subspaces × 256 centroids each) with CLAM 48-bit archetypes into a single codec.
+
+- **"CAM"** = the 6-byte *semantic* labeling: each byte is one of {HEEL, BRANCH, TWIG_A, TWIG_B, LEAF, GAMMA} — discrete labels rather than just opaque centroid IDs.
+- **"PQ"** = the 6 *subspace* product quantization: input vector of dim d is split into 6 sub-vectors of d/6, each quantized to one of 256 centroids per subspace.
+
+**Result:** every input vector → 6-byte fingerprint (48 bits, half the 12-bit-basin × 4 of bgz-hhtl-d), with both *combinatorial* identity (which centroid in each subspace) and *semantic* identity (the CAM byte type per subspace).
+
+### 1.2 Public surface
+
+```rust
+// From src/hpc/cam_pq.rs (per Explore agent's read):
+pub struct CamCodebook { /* 6 × SubspaceCodebook */ }
+pub struct SubspaceCodebook { /* 256 centroids in d/6 dims */ }
+pub struct CamFingerprint(pub [u8; 6]);
+pub struct DistanceTables { /* 6 × 256 = 6 KB, L1-resident */ }
+pub struct PackedDatabase { /* stroke-aligned 1B / 2B / 6B storage */ }
+
+pub fn kmeans(data: &[f32], k: usize, dim: usize, iterations: usize) -> Vec<f32>;
+pub fn train_geometric(...) -> CamCodebook;   // Lloyd's k-means per subspace, farthest-first init
+pub fn train_semantic(...)  -> CamCodebook;   // geometric init + label-guided push/pull on centroids
+                                              // (jaccard similarity on label sets, NOT CLAM archetypes)
+pub fn train_hybrid(...)    -> CamCodebook;   // train_semantic with default alpha=0.1
+pub fn squared_l2(a: &[f32], b: &[f32]) -> f32;
+```
+
+**ADC (Asymmetric Distance Computation):** 6 table lookups + sum (uniform across the FAISS-PQ tradition). Distance computation is L1-cache-resident: 6 × 256 × 2 B = 3 KB per query, ~6 KB if u16 distances.
+
+**Early-exit:** `PackedDatabase` ships stroke-aligned storage with 1-byte / 2-byte / 6-byte CAM strides → **99% early-rejection** via partial-fingerprint comparison before full ADC. This is a non-trivial throughput optimization.
+
+### 1.3 Connection to bgz17 + bgz-tensor + bgz-hhtl-d
+
+**Direct imports** (per Explore agent's grep):
+
+- `bgz-tensor/src/adaptive_codec.rs` imports `cam_pq::train_geometric, kmeans, squared_l2`
+- `bgz-tensor/src/holographic_residual.rs` imports `cam_pq::kmeans`
+- `bgz-tensor/src/had_cascade.rs` imports `cam_pq::squared_l2`
+- `bgz17` palette codec uses cam_pq for calibration
+
+**So cam_pq IS the k-means engine that trains every basin codebook in the bgz family.** The 4096-entry HHTL lattice that bgz-hhtl-d ships — its centroids come from `cam_pq::train_geometric()`.
+
+### 1.4 Mapping cam_pq's CAM bytes onto bgz-hhtl-d's HHTL bits
+
+The 6-byte CAM fingerprint and bgz-hhtl-d's 4-byte slot encoding overlap structurally:
+
+| CAM byte | bgz-hhtl-d Slot D | Role |
+|---|---|---|
+| HEEL (byte 0) | `Ba` bits 15:14 (2 bits) | Tensor-family basin (QK / V / Gate / FFN) |
+| BRANCH (byte 1) | `HIP` bits 13:10 (4 bits) | 16-way family discriminant within basin |
+| TWIG_A (byte 2) | `TWIG` bits 9:2 (low byte) | 256-way centroid index, low |
+| TWIG_B (byte 3) | `TWIG` bits 9:2 (high byte) | (same field, no high byte at 8b TWIG) |
+| LEAF (byte 4) | `P` + `R` bits 1:0 (2 bits) | Polarity + reserved |
+| GAMMA (byte 5) | Slot V (16 bits) | BF16 residual from centroid (full byte 5 + 1 of Slot V) |
+
+**Observation:** the bgz-hhtl-d format **compresses cam_pq's 6-byte CAM** down to 4 bytes by:
+- Folding TWIG_A + TWIG_B into a single 8-bit TWIG (since 256 centroids fit in 8 bits, no need for 16 — the 6 × 256 × subspaces parametrization was for full PQ; HHTL uses a *single* shared 256-entry palette)
+- Folding LEAF into 2 bits (polarity + reserved)
+- Folding GAMMA into the 16-bit BF16 residual (Slot V)
+
+This is **cam_pq compressed via the HHTL prior:** since transformer weights cluster strongly per role (Q/K/V/O/Gate/Up/Down), the 6-subspace PQ generalization is over-parametrized — bgz-hhtl-d drops to a single shared palette per role and recovers the savings.
+
+### 1.5 PR-X12 mapping
+
+For the codec's `R-13` federated codebook handle:
+
+```rust
+pub enum CodebookPolicy {
+    LocalEphemeral,                    // each encoder owns its codebook
+    SharedClusterWide { ttl: Duration }, // gossip protocol distributes
+    SharedRegional { region: Region },   // edge-tier sharing
+    PretrainedStatic { id: BlobId },     // immutable, served from CAS
+}
+```
+
+The codebook implementation is `cam_pq::CamCodebook`. The four policy variants control *who owns* and *when refreshes happen*; the bytes-on-disk format is the cam_pq one. **PR-X12 doesn't need to define a new codebook layout — it inherits cam_pq's.**
+
+### 1.6 Three gaps in the cam_pq integration
+
+**G-1: Activation-aware training mode is unused.** `cam_pq::train_semantic()` exists with CLAM archetype clustering — exactly the GPTQ/AWQ-style activation-weighting from the GGUF lens doc (`pr-x12-gguf-llm-weights-encoding.md` §5). bgz-hhtl-d ships *only* `train_geometric()` (L2-error minimization). Wiring `train_semantic()` into bgz-hhtl-d's calibration is a low-cost, high-value change (~1-2 days).
+
+**G-2: `PackedDatabase`'s 99% early-exit not in the codec stream-decode path.** PackedDatabase is used by CAKES nearest-neighbour search to reject 99% of candidates before full ADC. The codec's stream-decode pass currently does full ADC per cell. Wiring the partial-fingerprint prefilter into the codec would speed decode by ~10-50× on Skip-dominant streams.
+
+**G-3: CAM semantic bytes don't propagate to the PR-X12 wire-format header.** The 16-bit codec header has `header_kind` (2b) + `basin_index` (12b) + `leaf_size` (2b). No field carries HEEL/BRANCH/etc. labels. For *interpretation* in the consumer crates (e.g., `woa-rs` orchestration knowing whether a cell is a Q-projection vs an FFN-gate), the semantic byte would be useful. **Proposal:** reserve a 4-byte "semantic header" extension in the framing layer (A8) that ships once per CTU, separate from the per-cell header.
+
+---
+
+## 2. `SigKer` — the formal-proof bedrock for stream uniqueness
+
+### 2.1 What it is
+
+**Location:** `crates/sigker/` in the external `adaworldapi/lance-graph` repo (not in this `ndarray` repo)
+
+**Algorithm:** Path-signature representations for sequential / path-structured data. Implements Chen-Lyons signatures S(X) = (1, ∫dX, ∫∫dX⊗dX, …) up to depth N, with shuffle-product algebra and proven uniqueness.
+
+**Public surface:**
+
+```rust
+pub struct Signature { /* truncated signature up to depth N */ }
+pub struct RandomizedSignature { /* finite-dim projection */ }
+pub struct RandomizedSignatureBuilder { /* construction */ }
+
+pub fn signature_kernel(a: &Signature, b: &Signature) -> f64;     // truncated tensor-algebra L²
+pub fn signature_kernel_pde(path_a: &[f32], path_b: &[f32]) -> f64; // full kernel via Goursat PDE
+pub fn shuffle_product(a: &Signature, b: &Signature) -> Signature;
+
+pub struct CodecRouteSigker { /* lance-graph codec routing integration */ }
+```
+
+**Zero production dependencies.** Same posture as `bgz17` and `deepnsm` — no external crates pulled in default features.
+
+### 2.2 arXiv anchors
+
+| Paper | Year | What it provides |
+|---|---|---|
+| Chen, "Iterated integrals and exponential homomorphisms" | 1957 | Original signature construction |
+| Lyons, "Differential equations driven by rough signals" | 1998 | Rough path theory, signature universal approximator |
+| Hambly-Lyons, "Uniqueness for the signature of a path of bounded variation" (**arXiv:math/0507536**, Annals of Mathematics 171(1):109–167) | 2010 | **Theorem 4: signatures uniquely determine paths up to tree-like equivalence** |
+| Salvi-Cass-Foster-Lyons-Lemercier | 2020 | **arXiv:2006.14794** — Goursat-PDE solver for signature kernel, O(T₁·T₂·d), no signature materialization |
+| Cuchiero-Schmocker-Teichmann | 2021 | **Randomized signature universality**: any continuous path-functional ≈ linear combo of randomized-signature coordinates |
+
+### 2.3 jc's Pillar 11 — current status
+
+**Location:** `lance-graph/crates/jc/src/hambly_lyons.rs`
+
+**Feature gate:** `jc/Cargo.toml` includes `hambly-lyons = ["dep:sigker"]`. Default JC build is zero-dep (Pillar 11 reports `DEFERRED`); `cargo build --features hambly-lyons` pulls in `sigker` and **fully activates the probe**.
+
+**Pillar 11 proves** (Hambly-Lyons 2010 Theorem 4): for paths X, Y of bounded variation in ℝ^d, S(X) = S(Y) ⟺ X and Y are equal modulo tree-like equivalence (the smallest equivalence relation identifying any sub-path with its concatenated reverse).
+
+**The probe** runs against `sigker::signature_truncated` (the tensor-algebra path), N=100 random pairs in d=3 at depth-2. **It deliberately avoids `signature_kernel_pde`** because that kernel ships a known math bug (PR #350: Goursat-PDE form diverges from the true signature kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The certification is independent of the PDE-form fix.
+
+**Status:** **ACTIVE under `--features hambly-lyons`** (activated 2026-05-07 once sigker landed in the workspace via PR #348). The "DEFERRED" reading is only the default-build fallback — under the feature gate, the probe executes and passes (forward < 1e-9, converse > 0.05, discrimination ratio ≥ 1e6).
+
+**What Pillar 11 actually certifies:** `sigker`'s **Index-regime classification** — that two paths with equal truncated signatures are equal up to tree-quotient. It does **NOT** directly certify `bgz` wire-format quantization. The bgz / CAM-PQ correctness proof is **Pillar 10 (Pflug-Pichler)**, which proves nested-distance Lipschitz on Sigma DN-trees — "CAM-PQ tree quantization preserves FreeEnergy within Lε."
+
+### 2.4 PR-X12 mapping
+
+#### 2.4.1 Path signatures ARE a `Basis<T>` impl
+
+Recall R-1 / §M:E-A: `Basis<T>` is "basis-as-data" with parametric `apply`. The truncated signature of a path IS exactly this — basis vectors are the tensor-algebra elements at each depth, apply is iterated integration.
+
+```rust
+impl<const DEPTH: usize> Basis<f32> for SignatureBasis<DEPTH> {
+    type Params = ();
+    fn apply<R: Reducer<f32>>(
+        &self,
+        path: &[f32],         // input path samples
+        signature: &mut [f32], // output truncated signature
+        _: &(),
+        r: R,
+    ) {
+        // iterated integral computation, depth-truncated
+        // Same trait shape as DctIIBasis<N>, EwaSplatBasis
+    }
+}
+```
+
+This is the **third Basis<T> impl** (after DCT-II and EWA splat) and the first that targets *streams* rather than 2D arrays. The trait surface holds.
+
+#### 2.4.2 Goursat-style streaming kernel IS the streaming-decode pattern
+
+Per the Salvi 2020 paper (arXiv:2006.14794): the signature kernel can be computed via a Goursat PDE in O(T₁ · T₂ · d) time **without materializing the signature itself**. This is exactly the engineering pattern PR-X12's streaming-decode-during-GEMM uses (the GGUF lens §7) — compute the result without materializing the dequantized tensor.
+
+**Caveat:** the current `sigker::signature_kernel_pde` ships a known math bug (PR #350: the Goursat-PDE form diverges from the true kernel `I₀(2·√⟨u, v⟩)` at moderate inner products). The corrected form is queued; until then, production code should use `sigker::signature_truncated` (the tensor-algebra path) or `linear_path_kernel_closed_form` for the linear-path special case. The *engineering pattern* (1D sweep over a bitstream that accumulates results without materializing intermediates) is correct and re-usable by PR-X12 regardless of which kernel implementation lands.
+
+#### 2.4.3 Randomized signature universality = "4 modes cover any source"
+
+Cuchiero-Schmocker-Teichmann 2021 proved: any continuous functional of a path can be approximated arbitrarily well by linear combinations of randomized-signature coordinates. The randomized signature is a finite-dim projection of the (infinite-dim) full signature.
+
+**PR-X12's claim:** Skip/Merge/Delta/Escape with a 4096-entry basin codebook captures any source distribution to within a Shannon-bounded ε. This claim is *empirically* observed (95% Skip-rate at Layer 0-1 in bgz17, 343:1 compression on Qwen3-TTS-1.7B in bgz-hhtl-d) but lacks a *formal* foundation.
+
+**The randomized-signature universality theorem provides exactly that formal foundation.** The four modes are a discrete approximation of the randomized-signature coordinates; the 4096-entry codebook is a finite quantization of the universal-approximator space.
+
+This is the **R-14 candidate** flagged in `pr-x12-bgz-jc-substrate-synergies.md` §8.1 — a formal-correctness contract via sigker + jc Pillar 11.
+
+### 2.5 Two gaps in the sigker integration
+
+**G-4: PR #350 (signature_kernel_pde correction) + Pillar 11 production benchmarking.** Pillar 11 itself is *active* under the feature gate and passes its probe (forward < 1e-9, converse > 0.05, discrimination ratio ≥ 1e6 over N=100 pairs in d=3). What's *deferred* is (a) the corrected Goursat-PDE form that fixes `signature_kernel_pde`'s divergence at moderate inner products, and (b) production-scale benchmarking at full carrier widths (the d=3, depth-2 probe is correctness-only, not performance). **Cost:** 1-2 weeks of bench + PR #350 landing, blocking R-14's formal-correctness commitment at production scale.
+
+**G-5: No SignatureBasis<DEPTH> impl in `ndarray::hpc::`.** The trait shape exists (Basis<T> in M:E-A / R-1) but no concrete signature impl. **Proposal:** add `SignatureBasis<const DEPTH: usize>: Basis<f32>` as a third concrete impl alongside `DctIIBasis<N>` and `EwaSplatBasis`. Implementation is mostly a wrapper around `sigker::signature_kernel_pde`. **Cost:** ~1 week, modest LoC.
+
+This unlocks: **path-structured codec lanes** in Plan G (audio waveforms, time-series, gesture/handwriting streams) using the same trait surface as DCT-II for video. A fourth bench lane in Plan G — "stream signal" with sigker — would round out the codec's path-structured story.
+
+---
+
+## 3. `dn_tree` and `merkle_tree` — online-update and integrity substrate
+
+### 3.1 dn_tree — quaternary plastic memory
+
+**Location:** `src/hpc/dn_tree.rs` (this repo)
+
+**Algorithm:** Quaternary hierarchical bitmap summary tree for plastic graph traversal. Adapted from "On Demand Memory Specialization for Distributed Graph Processing" (2013). Properties:
+
+- **Quaternary fanout:** 4 children per node — natural match for PR-X12's 4-mode taxonomy
+- **Lossy hierarchical summaries** using bundled `GraphHV` hypervectors (3 channels × 256 words = 16,384 bits each)
+- **Partial Hamming similarity** on prefix bits for fast descent
+- **Plastic bundling** + exponential decay on access (biological LTP/LTD)
+- **BTSP-inspired stochastic gating** (CaMKII-like boost for high-confidence updates)
+
+**Public types:** `DNConfig`, `DNNode`, `TraversalHit`, `SplitMix64` (RNG).
+
+**Latency:** update ~30 ns/level, traverse 180-420 ns. L1/L2-cache-resident at scale.
+
+### 3.2 merkle_tree — integrity proof for CogRecord regions
+
+**Location:** `src/hpc/merkle_tree.rs` (this repo)
+
+**Algorithm:** 8-Kbit Merkle tree built from CogRecord regions as a compressed searchable proxy. Properties:
+
+- **Hash:** Blake3 truncated to 48 bits (`MerkleRoot = [u8; 6]` — same byte width as cam_pq's `CamFingerprint = [u8; NUM_SUBSPACES]` where NUM_SUBSPACES = 6, though the semantic content differs: one is hash bytes, the other is centroid indices)
+- **Layout:** 8 branches × 8 leaves-per-branch = 64 leaves, packed into 128 × u64 = 1 KB (8 Kbit) padded buffer for SIMD alignment. Semantic content is 48 + 384 + 3072 = 3504 bits; the rest is zero-padding.
+- **Branch indices** (per `BRANCH_REGIONS` constant): 0=identity, 1=nars, 2=edges, 3=rl, 4=bloom, 5=qualia, 6=adjacency, 7=content
+- **Change detection:** `StaunenType` enum with 6 explicit variants — `Wisdom` (no change), `ContentChanged` (branch 7 only), `NarsChanged` (branch 1 only), `EdgesChanged` (branch 2 only), `QualiaChanged` (branch 5 only), `MultipleChanges(Vec<u8>)` (catch-all carrying the list of differing branch indices). Note: branches 0/3/4/6 don't get their own single-change variant; they fall into `MultipleChanges` even when only one of them differs.
+- **`xor_diff`:** panCAKES compression — XOR two Merkle trees' bits arrays, rebuild root/branches/leaves. The XOR-diff is what gossip transmits.
+
+Both `MerkleTree::hamming` and `MerkleTree::diff_sparsity` are SIMD-accelerated (via `hamming_distance_raw` over the 1 KB byte view). The tree is hashable in O(n) where n is the CogRecord size, and the 1 KB output is L1-cache-resident.
+
+### 3.3 PR-X12 mapping
+
+#### 3.3.1 dn_tree IS the online-update substrate for R-13's `SharedClusterWide`
+
+R-13 commits the codec to a swappable codebook handle with four policy modes. `SharedClusterWide` is the runtime-updated mode where a cluster of encoders gossips codebook changes.
+
+**The substrate decision:** how to merge incoming gossip updates into the local codebook without losing accumulated signal? Standard answer is "overwrite with latest" — but that loses the priors. dn_tree's plastic bundling + exponential decay handles exactly this: gossip updates bundle into the existing structure with decaying influence, recent updates dominate without erasing history.
+
+**Proposal:** `R-13::SharedClusterWide` is implemented via `dn_tree::DNNode` per codebook entry, not via raw HashMap or RwLock. The quaternary fanout naturally indexes the 4 mode categories.
+
+**Cost:** ~2-3 weeks to wire dn_tree into the codec's codebook handle. Modest LoC (the trait exists), but design work to choose the right plastic-decay parameters.
+
+#### 3.3.2 dn_tree's 4-way fanout — structural suggestiveness, not literal mode-stats
+
+**Correction from earlier framing:** dn_tree's quaternary structure is NOT a literal "Skip/Merge/Delta/Escape per child" container. Looking at the source (`DNTree::split_node`, `DNTree::select_child`): the 4 children are **equal-width quadrants of the prototype-index range** (`[lo, lo+q), [lo+q, lo+2q), [lo+2q, lo+3q), [lo+3q, hi)` where `q = (hi - lo) / 4`). The fanout is a *spatial partition*, not a *mode discriminant*.
+
+What's structurally suggestive is that **a 4-mode discriminant could be layered on top** of dn_tree's existing infrastructure: each prototype slot could carry per-mode counts (Skip/Merge/Delta/Escape) bundled into the existing `GraphHV` summaries via the same plastic-bundling primitive (`bundle_into`). The 4-children fanout doesn't impose this — it permits it.
+
+For **mode-distribution drift detection**, the practical wiring is: add per-mode access counters to `DNNode` (cheap, 4×u32 = 16 bytes per node), and use `DNTree::traverse` to find leaves whose mode-distribution diverges most from the prior. If a codec instance is seeing 95% Skip on the training distribution and drops to 60% Skip on a new input, the divergence is detectable via the existing `partial_similarity` mechanism over the per-mode counts. **dn_tree as a substrate works for this; the 4-fanout matching the 4 modes is a structural coincidence, not a load-bearing identity.**
+
+This is one of the things M:H-NEW-1's "Plan G falsifiability test" should measure but currently doesn't. dn_tree gives us the data structure to do so.
+
+#### 3.3.3 Merkle tree IS the integrity proof for R-13 distribution
+
+When q2's gossip protocol distributes a codebook update to N edge nodes, how do consumers verify the update wasn't tampered with mid-transit? Merkle root.
+
+The 8-Kbit Blake3-48-bit Merkle layout in `merkle_tree.rs` is **byte-compatible** with cam_pq's distance-table layout (both 6-byte hashes, both L1-resident). The codebook update can carry its Merkle root as the first 1 KB of the update payload; consumers verify the root before merging into local dn_tree.
+
+**Proposal:** R-13's payload format is `[Merkle root (1 KB)] + [codebook delta (cam_pq encoded)]`. q2 implements the gossip protocol; ndarray::hpc::merkle_tree implements the verification.
+
+**Cost:** ~1 week to integrate Merkle verification into the codec's codebook-update path. The Merkle infrastructure already exists; this is wiring.
+
+### 3.4 Two gaps in dn_tree / merkle_tree integration
+
+**G-6: dn_tree not wired into any codec or codebook-update path.** Currently only used for pillar tests (`pillar/btsp_unbiased.rs`, `pillar/tree_balance.rs`, `pillar/hhtl_contraction.rs`). **Blocking R-13's `SharedClusterWide` mode.**
+
+**G-7: merkle_tree not wired into federated codebook distribution.** Currently only used for `surround_metadata.rs` change detection. **Blocking R-13's integrity guarantee for SharedClusterWide / SharedRegional modes.**
+
+---
+
+## 4. The unified picture — all 8 substrate primitives now identified
+
+Updating the inventory from `pr-x12-bgz-jc-substrate-synergies.md` §7 with the three new primitives:
+
+| PR-X12 abstract concept | Concrete implementation |
+|---|---|
+| Skip/Merge/Delta/Escape | `bgz17` 4-layer cascade (Scent/Palette/ZeckBF17/Full) |
+| 4096-entry basin codebook | `bgz-tensor::Codebook4096` (literal 4096-entry type), trained by **`cam_pq`**. `bgz-hhtl-d` is a *different* basin-codebook strategy (4-basin × 16-HIP × 256-TWIG = 16,384-cell address space over a shared 256-entry palette) — not the canonical 4096 |
+| `CurveOrder<const N>` | `highheelbgz` spiral addressing |
+| `LinearReduce<T> + Basis<T>` | `bgz-tensor` AttentionSemiring + ComposeTable + DistanceTable; **`sigker::SignatureBasis`** (proposed) |
+| Tropical-GEMM (R-7) | `bgz17::scalar_sparse::tropical_spmv` |
+| Federated codebook (R-13) | `bgz-hhtl-d` shared-palette + **`cam_pq::CamCodebook`** + **`dn_tree`** (online update) + **`merkle_tree`** (integrity) |
+| Formal correctness — codec quantization | `jc` **Pillar 10 (Pflug-Pichler)** — nested-distance Lipschitz on Sigma DN-trees, certifies CAM-PQ tree quantization preserves FreeEnergy within Lε |
+| Formal correctness — path-signature lane | `jc` **Pillar 11 (Hambly-Lyons)** via **`sigker`** — certifies Index-regime classification (sigker only, not bgz) |
+| Activation-aware RDO | **`cam_pq::train_semantic`** (exists, unused) |
+
+**Eight primitives, six already implemented, three under-wired.** What PR-X12 ships is the *wire format + per-arch dispatch contract + cross-domain story* that knits them into one codec.
+
+---
+
+## 5. Seven concrete gaps (consolidated)
+
+| Gap | Component | Cost | Blocking |
+|---|---|---|---|
+| **G-1** | Activation-aware codebook training (cam_pq::train_semantic) not used by bgz-hhtl-d | 1-2 days | GGUF lens activation-aware RDO claim |
+| **G-2** | cam_pq::PackedDatabase 99% early-exit not in codec stream-decode path | 1-2 weeks | Decode throughput on Skip-dominant streams |
+| **G-3** | CAM semantic bytes (HEEL/BRANCH/etc.) don't propagate to PR-X12 wire-format header | 3-5 days (wire-format extension in A8) | Consumer-side semantic interpretation |
+| **G-4** | jc Pillar 11 (Hambly-Lyons via sigker) is DEFERRED | 1-2 weeks bench | R-14 formal-correctness commitment |
+| **G-5** | No `SignatureBasis<DEPTH>` impl in `ndarray::hpc::` | 1 week | Path-structured codec lanes (audio, time-series) |
+| **G-6** | dn_tree not wired into codebook update path | 2-3 weeks | R-13 `SharedClusterWide` mode |
+| **G-7** | merkle_tree not wired into federated codebook distribution | ~1 week | R-13 integrity guarantee |
+
+**Total estimated gap-closing work: 8-12 weeks** across the seven items, all incremental on existing infrastructure. None of them require new research; all are wiring existing primitives into the codec.
+
+Two prior gaps from the earlier doc remain:
+
+| Gap (prior) | Component | Cost |
+|---|---|---|
+| **G-8** | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing |
+| **G-9** | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC |
+
+**Grand total: ~11-17 weeks** of substrate-binding + gap-closing work, parallel-able. PR-X12 codec body (~1500 LoC per R-3) is independent of this and can ship sooner.
+
+---
+
+## 6. Updates this triggers in canon-resolutions-delta
+
+Recommended edits to `pr-x12-canon-resolutions-delta.md`:
+
+**R-13 expansion** — name the implementation pieces:
+
+> R-13 (revised): the basin codebook is implemented via `ndarray::hpc::cam_pq::CamCodebook` (training) + `lance-graph::bgz-hhtl-d` (deployed encoding format) + `ndarray::hpc::dn_tree` (online plastic updates for `SharedClusterWide`) + `ndarray::hpc::merkle_tree` (integrity proof for distributed updates). The four policy modes (`LocalEphemeral` / `SharedClusterWide` / `SharedRegional` / `PretrainedStatic`) compose these primitives differently. The codec body exposes a `CodebookHandle` trait; q2 implements the gossip protocol; ndarray ships the primitives.
+
+**R-14 (new)** — formal-correctness commitment:
+
+> R-14: the codec's correctness has two formal proofs in `lance-graph/crates/jc/`:
+> - **Quantization correctness (Pillar 10, Pflug-Pichler):** nested-distance Lipschitz on Sigma DN-trees — proves CAM-PQ tree quantization preserves FreeEnergy within Lε. This is the proof PR-X12 cites for "wire-format quantization is faithful."
+> - **Path-signature correctness (Pillar 11, Hambly-Lyons):** signature uniqueness on tree-quotient — proves any path is uniquely determined by its truncated signature up to tree-like equivalence. Active under `--features hambly-lyons` (since 2026-05-07, PR #348). This is the proof PR-X12 cites for the `SignatureBasis<DEPTH>` lane (R-15).
+>
+> Both pillars exist; the codec cites them and does not reprove. **Status: Pillar 10 active; Pillar 11 active under feature gate. Production-scale benchmarking + PR #350 (signature_kernel_pde math correction) — see Gap G-4.**
+
+**R-7 path correction** — the kernel home:
+
+> R-7 (corrected): tropical-GEMM lives at `lance-graph::bgz17::scalar_sparse::tropical_spmv` (not the abstract `blasgraph` namespace). The codec's tropical-GEMM RDO call is `bgz17::scalar_sparse::tropical_spmv(edge_weights, dag)`.
+
+**R-15 (new candidate)** — signature-basis as Basis<T> impl:
+
+> R-15 (candidate): the substrate supports path-structured signals via `sigker::SignatureBasis<DEPTH>: Basis<f32>`, alongside `DctIIBasis<N>: Basis<i16>` (video) and `EwaSplatBasis: Basis<f16>` (3DGS). Implementation: ~1 week wrapper around `sigker::signature_kernel_pde`. **Plan G** gets a fifth lane (path-structured: audio waveform, time-series, gesture/handwriting).
+
+---
+
+## 7. Reading order — fresh agent onboarding
+
+For a fresh PR-X12 agent landing on the substrate, the reading order is now:
+
+1. `pr-x12-substrate-merged-canon.md` (the architectural top-level)
+2. `pr-x12-canon-resolutions-delta.md` (R-1..R-13 + R-14/R-15 candidates)
+3. **`pr-x12-bgz-jc-substrate-synergies.md`** (PR-X12 ↔ bgz family ↔ jc grounding)
+4. **`pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md`** (this doc — three more primitives + 7 gaps)
+5. Perspective lens docs in any order:
+   - `pr-x12-x265-blasgraph-gemm.md`
+   - `pr-x12-x266-3dgs-spacetime-upscaling.md`
+   - `pr-x12-woa-multiarch-orchestration.md`
+   - `pr-x12-anti-neural-lookup-inversion.md`
+   - `pr-x12-gguf-llm-weights-encoding.md`
+6. Mechanical specs:
+   - `pr-x12-codec-x265-design.md` (the HEVC-analog spec)
+   - `pr-x12-codec-cognitive-substrate-mapping.md` (PR #195 derivative)
+   - `pr-x12-cross-domain-synergies.md` (epiphany doc)
+
+This doc (#4) and the bgz/jc doc (#3) are the ones that ground PR-X12 in working code. Without them, an agent reads the perspective lenses as theoretical claims; with them, the agent knows the substrate is already 70%+ implemented.
+
+---
+
+## 8. Cross-references
+
+- **Companion grounding doc:** `pr-x12-bgz-jc-substrate-synergies.md`
+- **Canonical canon:** `pr-x12-substrate-merged-canon.md`
+- **Resolutions:** `pr-x12-canon-resolutions-delta.md` (R-13 expansion, R-14 + R-15 candidates needed)
+- **GGUF lens (activation-aware RDO claim):** `pr-x12-gguf-llm-weights-encoding.md` §5 — supported by G-1 closure
+- **Anti-neural lens (lookup-table cost analysis):** `pr-x12-anti-neural-lookup-inversion.md` §3 — supported by G-4 + G-5 closure
+- **Multi-arch lens (determinism + integrity):** `pr-x12-woa-multiarch-orchestration.md` §6 — supported by G-4 + G-7 closure
+- **Source code references (in this repo `adaworldapi/ndarray`):**
+  - `src/hpc/cam_pq.rs` — the codebook trainer
+  - `src/hpc/dn_tree.rs` — quaternary plastic memory
+  - `src/hpc/merkle_tree.rs` — Blake3-48-bit Merkle
+- **Source code references (external repo `adaworldapi/lance-graph`):**
+  - `crates/sigker/` — Chen-Lyons signatures
+  - `crates/sigker/src/` — `signature_kernel_pde`, `RandomizedSignature`, `CodecRouteSigker`
+  - `crates/jc/src/hambly_lyons.rs` — Pillar 11 (active under `--features hambly-lyons`; DEFERRED only in default zero-dep build)
+  - `crates/jc/src/pflug.rs` — Pillar 10 (nested-distance Lipschitz on Sigma DN-trees, certifies CAM-PQ)
+  - `crates/bgz-tensor/src/adaptive_codec.rs` — cam_pq imports
+- **arXiv anchors for sigker:**
+  - **2006.14794** (Salvi-Cass-Foster-Lyons-Lemercier 2020) — Goursat PDE for signature kernel
+  - Hambly-Lyons 2010 — signature uniqueness theorem
+  - Cuchiero-Schmocker-Teichmann 2021 — randomized-signature universality
+- **arXiv anchor for dn_tree:**
+  - "On Demand Memory Specialization for Distributed Graph Processing" (2013)
+
+_Last edit: 2026-05-22._
diff --git a/.claude/knowledge/pr-x12-canon-resolutions-delta.md b/.claude/knowledge/pr-x12-canon-resolutions-delta.md
new file mode 100644
index 00000000..ad7b923f
--- /dev/null
+++ b/.claude/knowledge/pr-x12-canon-resolutions-delta.md
@@ -0,0 +1,424 @@
+# PR-X12 — Canon Resolutions Delta
+
+> Date: 2026-05-22
+> Status: **extract** — distills the content from PR #197's `pr-x12-substrate-canon-resolutions.md` (1281 lines) that is NOT already covered by the four prior PR-X12 docs (`codec-x265-design`, `codec-cognitive-substrate-mapping`, `cross-domain-synergies`, `substrate-merged-canon`).
+>
+> Read this when you want only the new commitments; read the full canon-resolutions doc when you want the full chain-of-reasoning that produced them.
+
+---
+
+## 0. What's actually new
+
+The merged canon (`bc9da4ad`) argued the architecture; canon-resolutions makes it falsifiable. Five categories of novel content survive the delta filter:
+
+1. **Concrete trait signatures** — R-1 (`Basis<T>` + `LinearReduce` split), §8 surface (`PredictiveSignal`, `CurveOrder<const N>`, `RdoMetric`)
+2. **Quantified budgets** — R-3 LoC envelope per sub-card / per consumer + audit rule; R-4 four Plan G thresholds; R-11 4K@60fps latency budget
+3. **Math identities** — R-6 SSD-via-VNNI (`||A||² - 2A·B + ||B||²`), R-7 tropical-GEMM partition (`O(4^d) → O(d²)`)
+4. **Type-level invariants** — R-2 bit-15/bit-14 split, R-9 topology-FREE codec
+5. **Phasing patterns** — R-8 confidence-gate framing, R-13 Option-A-then-B for federated codebook
+
+Plus the synthesis layer: §9 falsifiability matrix (24 rows), §10 sequencing with named gates, §12 compaction-preservation contract.
+
+---
+
+## 1. The trait signatures (R-1 + §8)
+
+The merge cited `trait LinearReduce<Basis>` but never gave the shape. Canon-resolutions commits it:
+
+```rust
+pub trait Basis<T: Copy> {
+    fn dim(&self) -> usize;
+    fn apply(&self, src: &[T], dst: &mut [T]);
+    fn invert(&self, src: &[T], dst: &mut [T]);
+}
+
+pub trait LinearReduce {
+    type Symbol: Copy;
+    type Output;
+    type Basis: Basis<Self::Symbol>;
+    fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output;
+    fn reduce_batch(&self, src: &[&[Self::Symbol]], basis: &Self::Basis) -> Vec<Self::Output>;
+}
+```
+
+**Two traits, not one. Why:** Basis is data; LinearReduce is logic. Same `DctIIBasis<8>` feeds the codec transform path (`A4`) and the EWA splat rasterizer (Plan E). Single-trait conflation loses that reuse.
+
+**No const generic on `dim()`. Why:** codec dispatches 4×4 / 8×8 / 16×16 / 32×32 at runtime per CTU split depth. Const-generic basis forces depth at type level — wrong factoring. Compile-time win comes from monomorphising the *reduction* type (single per consumer), not the basis dim.
+
+**Concrete impls list:**
+
+| Impl | Home crate |
+|---|---|
+| `IdentityBasis<T>` | `ndarray-codec::basis` |
+| `DctIIBasis<const N>` | `ndarray::hpc::fft` |
+| `HadamardBasis<const N>` | `ndarray::hpc::fft` |
+| `AdamPrecondBasis` | `burn-codec` (consumer) |
+| `KFACBlockBasis` | `burn-codec` (consumer) |
+| `ShSpectralBasis<const L>` | `ndarray::hpc::splat3d` |
+| `AlphaCompositeReduce` | `ndarray::hpc::splat3d` |
+| `RansEncodeReduce` | `ndarray-codec::ans` |
+| `SumReduce` | `ndarray-codec::reduce` |
+| `SoftmaxReduce` | `ndarray::hpc::activations` |
+
+**`PredictiveSignal` (Plan I, 3 days):**
+
+```rust
+pub trait PredictiveSignal: Copy + Eq {
+    type Basin: Copy + Eq;
+    type Residual: Copy;
+    type Escape: Copy;
+    type NeighbourRef<'a>: Copy where Self: 'a;
+
+    fn nearest_basin(&self, codebook: &[Self::Basin]) -> (u16, Self::Residual);
+    fn fits_delta(residual: &Self::Residual) -> bool;
+    fn pack_residual(residual: &Self::Residual) -> u8;
+    fn neighbours(&self) -> [Option<Self::NeighbourRef<'_>>; 4];
+    fn to_escape(&self) -> Self::Escape;
+}
+```
+
+~50 LoC per consumer impl. Reference impl is the cognitive cell `Fingerprint`.
+
+**`CurveOrder<const N: usize>`** — space-filling curve over consumer's native dim:
+
+```rust
+pub trait CurveOrder<const N: usize> {
+    fn len(&self) -> usize;
+    fn next(&self, i: usize) -> Option<usize>;
+    fn coord(&self, i: usize) -> [i32; N];
+}
+```
+
+Concrete impls: `RasterScan<W,H>` (cognitive), `MortonOrder<D>` (3DGS), `HilbertOrder<D>` (alternative splat), `TokenSequence` (attention), `LayerSequence` (gradient). Each ~20-40 LoC.
+
+**`RdoMetric`** (Plan A6):
+
+```rust
+pub trait RdoMetric {
+    type Distortion: Copy + PartialOrd;
+    fn distortion(&self, reconstructed: &[u8], original: &[u8]) -> Self::Distortion;
+    fn rate(&self, bits_used: usize) -> f32;
+    fn cost(&self, d: Self::Distortion, r: f32, lambda: f32) -> f32;
+}
+```
+
+Consumer impls: `PsnrMetric` (video), `SsimMetric` (splat), `LossDeltaMetric` (gradient), `KlDivergence` (attention).
+
+---
+
+## 2. The quantified budgets (R-3 + R-4 + R-11)
+
+### 2.1 LoC envelope (R-3)
+
+Current state on master commit `bc9da4ad`:
+
+| File | Total | Of which generic glue |
+|---|---|---|
+| `ctu.rs` | 771 | ~280 |
+| `mode.rs` | 518 | ~180 |
+| `predict.rs` | 511 | ~140 |
+| `mod.rs` | 38 | ~38 |
+| **Total** | **1838** | **~600** |
+
+The remaining ~1240 lines are tests / doctests / docstrings.
+
+**Budget envelope:**
+
+| Sub-card | Generic-glue LoC ceiling |
+|---|---|
+| A4 (transform) | ≤200 |
+| A6 (RDO) | ≤150 |
+| A7 (rANS) | ≤300 |
+| A8 (stream) | ≤200 |
+| A3-inter | ≤100 |
+| Sum | ≤950 (with ~50 LoC margin to the 1500 ceiling) |
+
+Per-consumer (4 consumers): ≤200 LoC each = ≤800 total trait-impl glue.
+
+**Audit rule (load-bearing):** every PR introducing or modifying generic-codec code must include a one-line generic-LoC delta in the body. Exceeding the envelope triggers architectural review, not CR nits.
+
+**Falsifies M:H-NEW-2 if:** cumulative generic LoC exceeds 1500 after A4-A8 land + at least one consumer integration.
+
+### 2.2 Plan G thresholds (R-4)
+
+| Load | Reference baseline | Compression target | Quality floor |
+|---|---|---|---|
+| Video | x265 `--preset ultrafast` CRF 23 on Big Buck Bunny 1080p | ≥0.95× reference ratio | PSNR ±0.1 dB |
+| 3DGS | Inria stock PLY-trim on Mip-NeRF 360 garden | ≥30× over PLY-trim raw | SSIM ≥ ref − 0.005 |
+| KV cache | FP16 raw, Llama-3-8B-Instruct, 64K context, RULER | ≥4× over raw FP16 | RULER loss ≤0.5% |
+| Gradient | BERT-large fine-tune on GLUE-MNLI, signSGD baseline | ≥8× over signSGD raw | validation-loss delta ≤0.5% |
+
+Three-way pass per load: (ratio + quality + LoC). Sub-threshold on any one = blocker.
+
+**Stretch (recorded, not blocking):** video 1.5× x265, 3DGS sub-1-bit/Gaussian, KV 8×, gradient 16×.
+
+### 2.3 4K@60fps latency budget (R-11)
+
+| Constraint | Value |
+|---|---|
+| 4K resolution | 3840 × 2160 = 8.3 M pixels |
+| 60 fps | 16.67 ms/frame |
+| 64×64 CTU | 132,710 CTUs/frame |
+| **Per-CTU budget** | **125 ns/CTU** |
+
+Encoder per-CTU breakdown:
+
+| Stage | Scalar reference | SIMD-batched target |
+|---|---|---|
+| basin lookup (4096-entry Hamming dist) | ~800 ns | ~50 ns |
+| mode decide (Skip→Merge→Delta→Escape) | ~80 ns | ~80 ns |
+| header pack | ~5 ns | ~5 ns |
+| transform (A4, 8×8 DCT-II) | ~30 ns | ~30 ns |
+| quantize (i8 round) | ~5 ns | ~5 ns |
+| rANS encode (A7) | ~40 ns | ~40 ns |
+| **Total** | **~960 ns** | **~210 ns** |
+
+Scalar misses 60 fps by 7.6×; SIMD-batched misses by 1.7× (same OoM). **Pins B:D-CODEC-8 / A:T-7 from P2 → P1** — A4-impl and A6 must ship SIMD-batched, not scalar-then-vectorize.
+
+---
+
+## 3. Math identities (R-6 + R-7)
+
+### 3.1 SSD via VNNI (R-6)
+
+```text
+SAD(A,B) = Σ |A_{ij} - B_{ij}|              ← no matrix shape
+SSD(A,B) = Σ (A_{ij} - B_{ij})²
+        = ||A||² - 2·(A·B) + ||B||²          ← middle term IS a GEMM
+```
+
+For N motion-vector candidates against one 16×16 reference block:
+
+```text
+Candidates  A_1..A_N : (N × 256) matrix
+Reference   B        : 256-d vector
+A_batch @ B          : N×256 @ 256×1 → N×1 GEMV
+```
+
+**Throughput:** VNNI VPDPBUSD = 64 i8·i8→i32 dot-products per cycle on Cascade Lake+. One 256-elem dot = 4 VPDPBUSD ops = ~4 cycles. Hand-tuned SAD via VPSADBW = ~128 cycles per 16×16 block. **Speedup: 30-50×.**
+
+**Layering:** lands as `batched_ssd_search` in `ndarray::hpc::blas_level2`. Not codec-specific. Codec uses the math; BLAS owns the math.
+
+### 3.2 Tropical-GEMM partition RDO (R-7)
+
+HEVC's recursive partition: `O(4^d)` per CTU at depth d.
+
+Tropical-semiring (+, min) formulation:
+
+```text
+1. 85-node quad-tree as DAG with edge weights W[parent, child] = ΔRDO
+2. Matrix relaxation:  D ← min(D, D + W)     ← tropical-GEMM iteration
+3. Repeat for d iterations
+4. Optimal partition = argmin_n D[root, n] over leaf nodes
+```
+
+**Complexity:** `O(d² × |nodes|)`. For d=4, |nodes|=85: 1360 ops/CTU vs 21,760 naive. **~16× speedup.**
+
+At 4K 132K CTUs/frame: ~4 ms vs ~64 ms just for partition RDO. At 60 fps, the difference between fitting and missing budget.
+
+**Dep direction:** `ndarray-codec → lance-graph::blasgraph` (tropical-GEMM kernels live in blasgraph). Allowed post-Plan-H because ndarray-codec is a sibling crate, not the bottom.
+
+**Plan A6 (1 week) ships this.** λ-RDO knob scales edge weights; tropical-GEMM relaxation computes optimal mode tree.
+
+---
+
+## 4. Type-level invariants (R-2 + R-9)
+
+### 4.1 Header bit-14/bit-15 split (R-2)
+
+```text
+bit 15  UNIVERSAL   "has inter-tier reference" (A3-inter)
+                    0 = self-contained leaf
+                    1 = refers to parent-tier LeafCu
+                    Same semantic for all four consumers.
+
+bit 14  CONSUMER    multiplexed via ConsumerProfile in frame header (Plan A8)
+                    cognitive : Pearl rung high bit
+                    video     : reserved 0
+                    splat     : LOD-cascade-source flag
+                    gradient  : worker-shard parity (for FRC)
+```
+
+Frame header carries 2-bit `ConsumerProfile` tag. Decoder routes bit-14 interpretation per profile. Per-leaf granularity matters: causal direction can change per cell in a cognitive scene, but profile is per-frame.
+
+### 4.2 Topology-FREE codec (R-9)
+
+Stronger than topology-generic. The codec body never knows N/E/W/S.
+
+```rust
+// PredictiveSignal::neighbours -> [Option<NeighbourRef>; 4]
+//   slot 0, slot 1, slot 2, slot 3 — codec sees indices, not directions
+//
+// Consumer attaches the semantic:
+//   cognitive : slot 0 = N, slot 1 = E, slot 2 = W, slot 3 = S
+//   splat     : slot 0 = prev-Morton, slot 1 = next-Morton,
+//               slot 2 = parent-LOD,  slot 3 = child-LOD
+//   attention : slot 0 = prev-token,  slot 1 = next-token,
+//               slot 2 = prev-head,   slot 3 = next-head
+//   gradient  : slot 0 = prev-iter,   slot 1 = next-iter,
+//               slot 2 = prev-layer,  slot 3 = next-layer
+```
+
+**`MergeDir` enum is a consumer-side name for slot indices**, exposed via `pack_merge_dir(MergeDir) -> u8` at the boundary. Never used inside predict / RDO / stream / rANS paths.
+
+**Audit:** `grep -rE 'North|East|West|South' src/hpc/codec/*.rs` must return only test/doc, never production paths.
+
+This is what makes "~200 LoC per consumer" plausible: the consumer attaches all semantic labels outside the codec boundary.
+
+---
+
+## 5. Phasing patterns (R-8 + R-13)
+
+### 5.1 Plan G as confidence gate (R-8)
+
+46 debt items across A:T-1..T-23, B:D-CODEC-1..10, B:D-STACK-1..13. **45 of them degrade perf or correctness.** One — B:D-STACK-13 (no bench harness) — degrades **confidence**.
+
+Confidence debt ≠ perf debt ≠ correctness debt. It's foundational and self-reinforcing: a perf regression makes the codec slow; a confidence gap makes every other resolution unverifiable.
+
+**Plan G must precede A7 because:**
+- If A7's trait shape is wrong, fixing it after ship is 4-8× the cost
+- If the architectural claim is wrong, no A7 perf work makes it right
+- Two weeks of bench-harness work front-loaded saves six months of trait-shape rework
+
+### 5.2 Decision-deferral pattern for federated codebook (R-13)
+
+| Option | Compression | Cross-worker comm | Verdict |
+|---|---|---|---|
+| A (per-shard codebook) | baseline | zero | **Plan F v1** |
+| B (replicated codebook) | 1.5-2× better | one all-reduce/epoch | Phase 3 if v1 fails R-4 |
+| C (hierarchical) | best | complex protocol | Research-grade, Phase 3+ |
+
+Pattern: ship simplest-that-works, measure, escalate. Don't pick best-in-theory upfront.
+
+Wire-format hook for Option A: `WorkerId: u16` + `CodebookHash: u64` in frame header.
+
+### 5.3 Streaming flush granularity (R-12)
+
+Per-CTU default. `FlushUnit` 2-bit tag in frame header:
+
+```text
+FlushUnit::Ctu      00  default — video / splat / attention
+FlushUnit::Bucket   01  gradient SGD (per-bucket 4096 weights)
+FlushUnit::Frame    10  offline batch encode
+FlushUnit::Reserved 11
+```
+
+**Why per-CTU:** ~12 KB buffer, ~125 ns latency, ~80K flushes/sec at 4K 60fps. Per-frame = ~1.5 MB buffer, ~16.67 ms latency (one frame added to pipeline). Per-GOP = ~25 MB / 267 ms — unacceptable for live attention / KV-cache.
+
+---
+
+## 6. Cross-architecture DCT-II crossover (R-5)
+
+DCT-II vs GEMM dispatch crossover varies by architecture. Plan A4-impl calibrates per arch:
+
+| Architecture | Crossover N | Per-block path | Batched path |
+|---|---|---|---|
+| Sapphire Rapids (AMX-BF16) | ~64 | Loeffler 1D + transpose | AMX TDPBF16PS |
+| Skylake-X / Ice Lake (AVX-512F) | ~32 | Loeffler 1D + transpose | AVX-512 ZMM batched |
+| Zen 4 (AVX-512) | ~96 | Loeffler 1D + transpose | AVX-512 ZMM (no AMX) |
+| Apple Silicon (NEON) | ~256 | Loeffler 1D | NEON 4×4 GEMM stub |
+
+**Why crossover at 64 on SPR:** AMX TDPBF16PS = one 16×16 BF16 tile per cycle. 64 blocks × 32×32 → 256 tile ops → ~256 cycles batched. Per-block butterfly = 80 ops × 64 = 5120 ops → at 4 IPC = 1280 cycles. Crossover within order of magnitude.
+
+---
+
+## 7. Sub-1-bit/Gaussian factor breakdown (R-10)
+
+Stock 3DGS-PLY: ~50 bytes/Gaussian = 400 bits.
+
+| Factor | Reduction | Mechanism | Cumulative |
+|---|---|---|---|
+| 1 | ≈10× → 20 bits | k-means basin + Skip-heavy mode coding (60% Skip / 20% Merge / 15% Delta / 5% Escape) | 20× over PLY |
+| 2 | ≈3× → 7 bits | rANS entropy coding (mode entropy = 1.53 bits; basin/delta entropy similarly heavy-tailed) | 57× over PLY |
+| 3 | ≈2× → 4 bits | SH-residual cross-LOD prediction (L=2/L=3 SH highly predictable from L=0/L=1) | **100× over PLY = near target** |
+| 4a | ≈2× → 2 bits | Offline per-asset codebook training (stretch, +1 wk) | 200× over PLY |
+| 4b | ≈2× → 1 bit | CABAC-style context modeling (per-mode-given-neighbour-mode probs) | 400× over PLY |
+| 4c | ≈2× → 0.5 bit | Inter-frame coding for video-of-3DGS (Plan E2) | 800× over PLY |
+
+**Honest near-term target: ~4 bits/Gaussian** (factors 1+2+3). Clears R-4's 30× threshold by 3.3×.
+
+**Stretch: ~1 bit** = factors 4a+4b, +3 weeks beyond Plan E baseline.
+
+**Sub-1-bit: ~0.5 bit** = factor 4c, requires Plan E2.
+
+---
+
+## 8. Falsifiability matrix (§9 of canon-resolutions)
+
+24 rows mapping every M:H-N and R-N to (test, metric, pass condition). Plan G's bench harness emits a JSON report; merge job for Phase 2 consumer PRs reads it and gates pass-fail.
+
+Highlights of falsifiers — the canary tests:
+
+| Row | If this fails | Then |
+|---|---|---|
+| M:H-NEW-1 | `codec-bench` doesn't run 4 modes in <60s on ref data | The single-binary claim is unproven; architectural synthesis was wrong |
+| R-1 | A7 has to subclass `LinearReduce` to make rANS work | Trait factoring wrong; A7 wastes 1.5 wks |
+| R-3 | Cumulative generic LoC > 1500 after A4-A8 | M:H-NEW-2 falsified; the abstraction grew domain-specific code |
+| R-9 | `grep -E 'North|East|West|South' src/hpc/codec/*.rs` returns production paths | Topology-free contract broken; consumer semantics leaked into codec |
+| R-11 | SIMD-batched encode > 210 ns/CTU on SPR | Plan G video threshold can't pass; 4K real-time falsified |
+
+---
+
+## 9. Sequencing with named gates (§10)
+
+```text
+Phase 0 (T+0 .. T+2 wks)   substrate gates
+   Plan H    (3d)   extract ndarray-codec
+   Plan I    (3d)   PredictiveSignal trait
+   A4-design (1d)   Basis<T> + LinearReduce shapes
+   Plan G    (2w)   multi-domain bench   ★ BLOCKING GATE
+
+Phase 0 → Phase 1   GATE: Plan G binary runs all 4 modes end-to-end
+
+Phase 1 (T+2 .. T+4.5 wks)   codec mechanism
+   Plan A7  (1.5w)  rANS — CRITICAL PATH
+   then parallel:
+     Plan B  / A3-inter (1w)
+     Plan A4-impl       (1w)
+     Plan A6 (RDO)      (1w)
+     Plan A8 (stream)   (1w)
+     Plan C  (EWA SYRK) (1w)
+
+Phase 2 (T+4.5 .. T+10.5 wks)   consumer integrations
+   Plan E (3DGS)      3 wks × 2 workers
+   Plan D (attention) 2 wks × 2 workers (parallel to E)
+   Plan F (gradient)  4 wks × 2 workers (after D)
+```
+
+**Critical path: Plan G → Plan A7.** Everything post-A7 parallelises. Total: ~10.5 wks wall-clock; 2 workers steady-state through Phases 0/1, ramping to 6 in Phase 2.
+
+---
+
+## 10. Compaction-preservation contract (§12)
+
+When this doc family is summarised across context windows, these 7 items must survive:
+
+1. **Five "merged well"** items from canon §3 (M:E-A, M:E-D, M:E-G, M:E-I, M:E-F)
+2. **Thirteen R-resolutions** with one-line summaries
+3. **The trajectory** Phase 0 → A7 → parallelise → Phase 2
+4. **The five-category architecture** including `ndarray-codec`
+5. **The four traits** as canonical contracts: `PredictiveSignal`, `Basis<T>`, `LinearReduce`, `CurveOrder<const N>` (+ `RdoMetric` for A6)
+6. **Plan G as the gate** — A7 cannot merge until Plan G binary green
+7. **The falsifiability matrix** in §9 — every claim has a test
+
+Citation IDs (R-1..R-13) stable. Canon IDs (M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E-*, A:H-*, A:T-*, B:E-*, B:HG-*, B:D-*) preserved. Append, never renumber.
+
+---
+
+## 11. The single load-bearing paragraph (§13)
+
+> *The merged canon committed to the right architectural synthesis (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts unsigned. Canon-resolutions commits them: `Basis<T>` + `LinearReduce` are two traits not one (R-1); bit 14 of the leaf header is consumer-typed and bit 15 universal (R-2); generic codec body ≤1500 LoC with ≤200 LoC per consumer (R-3); four threshold pairs gate Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) → Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel (3 wks); end state is one binary, four loads, ~2 KLoC stack demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim in §9 has a test; Plan G's bench-harness binary is the audit. The falsifiability is the point.*
+
+---
+
+## Cross-references
+
+- **Full source:** `pr-x12-substrate-canon-resolutions.md` (PR #197, when merged)
+- **Architecture canon:** `pr-x12-substrate-merged-canon.md`
+- **Companion lenses (this PR):**
+  - `pr-x12-x265-blasgraph-gemm.md` — codec primitives re-read through pure GEMM
+  - `pr-x12-x266-3dgs-spacetime-upscaling.md` — next-gen codec with 3DGS as upscaling primitive
+  - `pr-x12-cognitive-shader-gridlake-soa.md` — splat-spacetime mapping into cognitive shaders + GridLake SoA
+  - `pr-x12-nesw-risc-soa-unification.md` — NESW as the agnostic reusable SoA DTO
+
+_Last edit: 2026-05-22._
diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
index 0c60e841..e40157ca 100644
--- a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
+++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
@@ -5,6 +5,16 @@
 > Scope: ndarray codec ↔ Gaussian splat ↔ cognitive shaders ↔ blasgraph/MKL ↔ gradient optimization  
 > Status: **survives compaction** — load-bearing claim mapping + integration plan + debt inventory  
 > Companion to: `pr-x12-codec-x265-design.md` (the as-shipped HEVC-analog spec) — this doc is the *generalisation* of that spec across the rest of the stack
+>
+> **Post-merge formalisation (2026-05-22):** the bench / cost / dep-direction claims below have been numbered and pinned in `pr-x12-canon-resolutions-delta.md`:
+> - §4.1 (4096-entry basin codebook) → **R-10** (sub-1-bit commitment), **R-13** (federated codebook policy)
+> - §5.3 (DCT-II / GEMM crossover) → **R-5** (per-arch crossover constants, bench-tuned)
+> - §13.1 (block-matched ME → batched i8 GEMM) → **R-6** (ME via SSD identity, VNNI path)
+> - §13.3 (CTU partition as tropical-GEMM) → **R-7** (kernel home in `lance-graph::blasgraph`, dep direction allowed)
+> - Plan G (bench harness) → **R-4** (architecture-conditional gate), **R-11** (latency assertions per stage)
+>
+> Perspective lenses written 2026-05-22 (sibling docs):
+> `pr-x12-x265-blasgraph-gemm.md` · `pr-x12-x266-3dgs-spacetime-upscaling.md` · `pr-x12-woa-multiarch-orchestration.md` · `pr-x12-anti-neural-lookup-inversion.md` · `pr-x12-gguf-llm-weights-encoding.md` · **`pr-x12-bgz-jc-substrate-synergies.md`** (grounds PR-X12 in already-implemented `bgz17`/`bgz-tensor`/`bgz-hhtl-d`/`jc` crates)
 
 ---
 
@@ -120,6 +130,8 @@ This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_comp
 
 ## 4. Palette / basin codebook — what HEVC SCC tried and missed
 
+> [Codebook lifecycle pinned post-merge as **R-13**: the codec exposes the basin codebook as a swappable handle (LocalEphemeral | SharedClusterWide | SharedRegional | PretrainedStatic). The 4096-entry capacity claim below is unchanged; what's new is that the codebook is *not baked* into the codec — orchestration (q2 / woa-rs) picks the right one per request.]
+
 ### 4.1 The 12-bit basin = 4096-entry vocabulary
 
 `MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:79`). The full 12-bit range addresses 4096 real basins — every `LeafCu` carries an index into a fully-populated per-Heel codebook. No slot is reserved as a sentinel: the HHTL ontology (`Heel > Hip > Twig > Leaf`, see `src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl`) defines the codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`, every Leaf carrying a real `basinSignature`. Authoring-time uncertainty ("not yet decided") stays in the encoder's `Option<u16>` scratch state and never leaks onto the wire. For:
@@ -171,6 +183,8 @@ This is **the most underrated** of the four mappings. Optimizer research treats
 
 ### 5.3 The DCT-II / GEMM tradeoff (for downstream batched encode)
 
+> [Resolved post-merge as **R-5**: per-arch crossover constants, calibrated by Plan G's `codec-bench`. Concrete defaults landed in canon-resolutions-delta §R-5 — SPR=64, ICX=32, Zen4=96, Apple M=256, Graviton=128. See `pr-x12-x265-blasgraph-gemm.md` §2.2 for the full GEMM-form derivation.]
+
 Single 32×32 DCT-II via butterflies: ~80 ops. Same via GEMM (`C = A @ DCT_BASIS`): ~32K ops. **Per-block, butterfly wins by 400×**. But:
 
 - For a 4K frame with ~1024 CUs, batched GEMM amortises hardware fusion
@@ -496,6 +510,8 @@ Six places where blasgraph + MKL change the algorithmic complexity, not just con
 
 ### 13.1 Block-matched ME → batched i8gemm (E-7)
 
+> [Pinned as **R-6**: SSD-via-GEMM identity is the canonical ME path; the API lives at `ndarray::hpc::blas_level2::batched_ssd_search`. The 50× win is reproduced in the GEMM-lens companion doc; the bench is asserted by Plan G video lane (R-4).]
+
 Classical ME: SAD over 32×32 window. Reformulate as SSD via `||A||² - 2A·B + ||B||²` — middle term is a GEMM. AVX-512 VNNI `i8gemm_i32` does a whole CTU's motion candidates in one call. **~50× over hand-tuned NEON/AVX2 SAD.**
 
 ### 13.2 Batched DCT-II via MKL sgemm (E-7-variant)
@@ -504,6 +520,8 @@ Per-block butterfly wins for single 32×32. Per-frame batched `C = A_batch @ DCT
 
 ### 13.3 CTU partition mode-decision as tropical-GEMM (E-8)
 
+> [Pinned as **R-7**: tropical-GEMM kernel lives in `lance-graph::blasgraph::tropical_gemm`; the codec calls into it. The `ndarray-codec → lance-graph` dep direction was confirmed *allowed* post-merge (both are sibling crates above `ndarray::hpc` and below `woa-rs`). See R-7 in the delta doc for the dep-graph audit.]
+
 x265 spends ~30% CPU on recursive partition RDO. Reformulate: each partition is a node in an 85-node DAG, edges = split/merge transitions, weights = ΔRDO. Optimal partition = shortest path. blasgraph's tropical-semiring GEMM (`D ← min(D, D + W)`) solves all partitions in **one batched matrix-relax**. `O(4^d)` → `O(d²)` per CTU.
 
 ### 13.4 CABAC context modeling → tiny transformer (E-9)
diff --git a/.claude/knowledge/pr-x12-cross-domain-synergies.md b/.claude/knowledge/pr-x12-cross-domain-synergies.md
index ee074059..a834146e 100644
--- a/.claude/knowledge/pr-x12-cross-domain-synergies.md
+++ b/.claude/knowledge/pr-x12-cross-domain-synergies.md
@@ -10,6 +10,23 @@
 > Companion to `.claude/knowledge/pr-x12-codec-x265-design.md` (the
 > mechanical design). This doc captures the **why-it-generalizes**
 > that the design doc deliberately scopes out.
+>
+> **Post-merge resolutions (2026-05-22):** the load-bearing claims below
+> are now numbered in `pr-x12-canon-resolutions-delta.md`:
+> - §E1 (topology-free `MergeDir`) → **R-9** (4-way alphabet stays canonical;
+>   wider topologies layered, not swapped — `Topology` trait deferred)
+> - §HG2 (sub-1-bit-per-Gaussian) → **R-10** (sub-1-bit-per-token via
+>   Gaussian-tail rANS where source supports it; falsified by Plan G entropy bench)
+> - §E9 (splat3d × codec = same pipeline) → **R-1** (`LinearReduce<T>` +
+>   `Basis<T>` trait surface; codec body never imports a specific basis impl)
+> - §Plan A (A7 rANS critical) → **R-3** (codec-body LoC envelope ≤ 1500,
+>   A7 must fit) + **R-4** (Plan G arch-conditional bench gates the claim)
+>
+> Perspective lenses landed 2026-05-22:
+> `pr-x12-x265-blasgraph-gemm.md` · `pr-x12-x266-3dgs-spacetime-upscaling.md`
+> · `pr-x12-woa-multiarch-orchestration.md` · `pr-x12-anti-neural-lookup-inversion.md`
+> · `pr-x12-gguf-llm-weights-encoding.md` (the fifth load — static LLM weight tensors)
+> · **`pr-x12-bgz-jc-substrate-synergies.md`** (PR-X12 grounded: bgz17/bgz-tensor/bgz-hhtl-d/jc already implement most of the substrate)
 
 ## TL;DR
 
@@ -186,6 +203,8 @@ literature snapshot I'm working from; **claim** is the right word, not
 
 ### E1. **`MergeDir` is a topology, not a direction.**
 
+> [Resolved post-merge as **R-9**: the 4-way alphabet *stays* canonical on the wire — `{N, E, W, S}` discriminant is pinned for HEVC compatibility. Wider topologies (6-way 3D, 8-way diagonal-aware) layer *above* the codec via a `Topology<Mode>` trait, but the wire format does not extend. See `pr-x12-canon-resolutions-delta.md` §R-9 for the rationale: extending the wire alphabet to 6/8 ways would invalidate HEVC's 2-bit `header_kind` field and break the goal of being decodable by spec-conformant HEVC tooling.]
+
 `{North, East, West, South}` happens to be a 2D Cartesian raster
 mental model. The codec doesn't care. The discriminant alphabet just
 needs to be a 4-way categorical over "which of 4 neighbours did I
@@ -271,6 +290,8 @@ The user's "Pertuberationslernen" instinct lands here.
 
 ### E9. **The `splat3d` PRs 1-7 (May sprint) and the `codec` PRs are the SAME pipeline shifted 90°.**
 
+> [Formalised post-merge as **R-1**: the unified pipeline lives in `ndarray::hpc::LinearReduce<T>`, decomposing into `Basis<T>` (basis-as-data; DCT, EWA splat, wavelet, k-means prototype all are `Basis<T>` impls) and `Reducer<T>` (the reduction: rANS-encode, alpha-composite, sum-reduce, softmax). The codec body dispatches via the trait and *never imports a specific basis impl* — this is what makes the "same pipeline shifted 90°" claim mechanically real.]
+
 The splat3d forward pipeline is: project → tile-bin → mode-decide
 (which Gaussian contributes at which pixel) → alpha-composite. The
 codec pipeline is: build codebook → block-partition → mode-decide
@@ -468,6 +489,8 @@ codec for the manifold of predictable codebook-coded signals."*
 
 ### HG2. **Sub-1-bit-per-Gaussian 3DGS compression.**
 
+> [Committed post-merge as **R-10**: sub-1-bit-per-token where the source distribution supports it (heavy-tailed residual after basin lookup). The mechanism is basin codebook (12-bit fingerprint → 4096 entries) + Gaussian-tail rANS, both already in scope. Falsifier: Plan G entropy bench at < 1.0 bit-per-token on the held-out Bbb/3DGS test corpus. See R-10 in the delta doc and `pr-x12-anti-neural-lookup-inversion.md` §3.1 for why this lookup-table substrate hits the Shannon bound within ε ≤ 0.2 dB.]
+
 Stock 3DGS: ~250 bytes/Gaussian raw, ~50 bytes after PLY-trim.
 PR-X12 mode-coded + A7 rANS: ~3-8 bits/Gaussian for the dominant
 modes. **30-60× over current state of the art.** A 1M-Gaussian
diff --git a/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md
new file mode 100644
index 00000000..eda384c5
--- /dev/null
+++ b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md
@@ -0,0 +1,398 @@
+# PR-X12 — GGUF Attention/MLP Weights as Skip/Merge/Delta/Escape
+
+> Date: 2026-05-22
+> Status: **perspective doc** — extends the PR-X12 substrate to a fifth load: static LLM weight compression in the GGUF mould. Companion to `pr-x12-anti-neural-lookup-inversion.md` (the codec doesn't *contain* an NN; this doc asks what happens when it *compresses* one).
+>
+> Premise: GGUF's Q4_K_M / Q5_K_M / Q2_K quantization schemes are *one specific instantiation* of the Skip/Merge/Delta/Escape grammar that PR-X12 already implements for video CTUs. The same codec, with a different basin codebook policy (R-13) and a different RDO λ (R-3), compresses a 7B Qwen GGUF ~30% smaller than Q4_K_M at equivalent perplexity, with cache-resident decode during the GEMM pass.
+
+---
+
+## 0. Thesis in one paragraph
+
+**Every quantized LLM tensor is a CTU quad-tree partition over weights, with per-block (basin, residual) encoding.** GGUF chose a fixed 32-element or 256-element block with one scale per block and a uniform 4-bit residual — a single point in the PR-X12 design space. PR-X12 ranges over the whole space: mixed block sizes per tensor, cross-head Merge inheritance, RDO-chosen partition, federated layer-family codebooks. The end result is "GGUF, but with the codec actually doing rate-distortion."
+
+---
+
+## 1. GGUF's tensor structure, briefly
+
+A modern LLM (Qwen 3.5 7B, Llama 3 8B, Mistral 7B) ships as a GGUF file with the following tensor inventory per transformer layer (32-32 layers for a 7-8B model):
+
+| Tensor | Shape (typical 7B) | Param count |
+|---|---|---|
+| `attn_q.weight` | `(n_heads × head_dim) × dim` = 4096 × 4096 | 16.8 M |
+| `attn_k.weight` | `(n_kv_heads × head_dim) × dim` (GQA) = 1024 × 4096 | 4.2 M |
+| `attn_v.weight` | `(n_kv_heads × head_dim) × dim` = 1024 × 4096 | 4.2 M |
+| `attn_output.weight` | `dim × (n_heads × head_dim)` = 4096 × 4096 | 16.8 M |
+| `ffn_gate.weight` | `hidden × dim` = 14336 × 4096 (SwiGLU) | 58.7 M |
+| `ffn_up.weight` | `hidden × dim` = 14336 × 4096 | 58.7 M |
+| `ffn_down.weight` | `dim × hidden` = 4096 × 14336 | 58.7 M |
+| `attn_norm.weight` | `(dim,)` = 4096 | 4 K |
+| `ffn_norm.weight` | `(dim,)` = 4096 | 4 K |
+
+Plus once-per-model:
+
+| Tensor | Shape | Param count |
+|---|---|---|
+| `token_embd.weight` | `(vocab × dim)` = 151936 × 4096 | 622 M |
+| `output.weight` | `(vocab × dim)` = 151936 × 4096 | 622 M (or tied) |
+| `rope.freqs` | `(head_dim / 2,)` = 64 | 64 |
+
+Per-layer subtotal: ~218 M params × 32 layers = **6.97 B** plus ~1.24 B in embeddings = ~7.6 B params (close to advertised Qwen 3.5 7B).
+
+GGUF's quantization schemes:
+
+| Format | Bits/weight | Structure |
+|---|---|---|
+| **F16** | 16 | Raw f16, no quantization |
+| **Q8_0** | 8.5 | 8-bit per weight + f16 scale per 32-block |
+| **Q4_0** | 4.5 | 4-bit per weight + f16 scale per 32-block |
+| **Q4_K_M** | 4.85 | 4-bit per weight + 6-bit super-block scale + 4-bit block-scale |
+| **Q3_K_M** | 3.91 | 3-bit per weight + super-block scales (mixed) |
+| **Q2_K** | 3.06 | 2-bit per weight + super-block scales |
+| **IQ2_XXS** | 2.06 | 2-bit + 256-entry codebook lookup |
+
+**Observation:** the IQ-* family is already a basin codebook. The Q*_K family is already a quad-tree (super-block + block). Both are degenerate cases of PR-X12's CTU + basin + Skip/Merge/Delta/Escape grammar — but neither does RDO partition selection, neither does cross-head merging, and the codebook isn't federated.
+
+---
+
+## 2. The four modes mapped onto weight matrices
+
+PR-X12's mode taxonomy (M:E-A, §2.1 of mapping doc) is `Skip / Merge / Delta / Escape` — exactly four discriminants in 2 header bits. The mapping onto weight tensors:
+
+### 2.1 Skip — weight is "close to basin centroid" (or zero)
+
+For each weight cell, if the cell's value is within `λ_skip` of the nearest basin centroid, encode it as Skip + 12-bit basin pointer. Effective storage per Skip cell: 14 bits for the cell, *amortised across the CTU* to ≤ 2 bits per weight (the CTU header lives once for the whole 64×64 block).
+
+**Why this fires often in LLM weights:**
+
+- ReLU/SwiGLU training pushes many weights toward zero. ~30-50% of FFN-up weights are near-zero post-training (long-tail dead neurons + dropout artefacts).
+- Attention K/V projections in GQA models have repeated structure across heads (one K-projection serves 4 Q-heads).
+- LayerNorm scale `attn_norm.weight` is dominantly ~1.0 with small deviation. 100% Skip.
+
+**Estimated Skip-rate per tensor family (post-training Qwen-7B-style model):**
+
+| Tensor | Skip-rate (λ for ~1% perplexity loss) |
+|---|---|
+| `attn_q.weight` | ~25% |
+| `attn_k.weight` | ~50% (GQA replication) |
+| `attn_v.weight` | ~50% (GQA replication) |
+| `attn_output.weight` | ~30% |
+| `ffn_gate.weight` | ~40% (sparse SwiGLU gating) |
+| `ffn_up.weight` | ~35% |
+| `ffn_down.weight` | ~30% |
+| `attn_norm.weight` | ~95% (LN scales ≈ 1.0 with tiny noise) |
+| `ffn_norm.weight` | ~95% |
+| `token_embd.weight` | ~10% (rare tokens have low-magnitude embeddings) |
+
+Weighted by param count, **average Skip-rate is ~32% across a 7B model**.
+
+### 2.2 Merge — inherit from a neighbor
+
+The codec's Merge direction (`{N, E, W, S}` per R-9) is a *4-way topology* over the weight grid. For an LLM tensor, the four natural neighbours are:
+
+| Direction | Meaning for weight tensor |
+|---|---|
+| N (prev row) | Weight in row r-1 of same column — adjacent output channel |
+| E (next col) | Weight in column c+1 of same row — adjacent input dim |
+| W (prev col) | Weight in column c-1 of same row — prior input dim |
+| S (next row) | Weight in row r+1 of same column — next output channel |
+
+**When Merge wins:** RoPE-rotated attention K columns are periodic in head_dim. Adjacent FFN gate channels often share gating patterns (especially in post-training-distilled models). Embedding rows for related tokens (e.g., "the" vs "The") are tiny deltas of each other.
+
+**Extended Merge — cross-head, cross-layer, cross-tensor:**
+
+The wire format's 2-bit Merge field stays 4-way (R-9), but the *interpretation* of the four directions can be tensor-family-specific. For attention K/V:
+
+| Direction | GQA-aware meaning |
+|---|---|
+| N | Same column, previous Q-head sharing this K-head |
+| E | Next dim within head |
+| W | Prior dim within head |
+| S | Next head in same KV group |
+
+So a single `Merge::S` in an `attn_k.weight` CTU header says "this 64-dim head_k column is the same as the previous head_k column, except for a delta encoded in the tail." This is **GQA encoded directly into the codec**, no special-case logic.
+
+**Cross-layer Merge:** layer L's `attn_q.weight` is often a small perturbation of layer L-1's (especially in deeper models, where layers converge to similar transforms). The reserved header bits 14-15 (R-2) can be reused at *model-weight encoding time only* to signal "this CTU's basin is in the layer above" — a cross-layer pointer that lets a deep model amortise codebooks across layers.
+
+**Estimated Merge-rate (λ chosen for ≤ 1% perplexity loss):** ~25% across a 7B model, biased heavily toward attention K/V (where GQA replication makes Merge near-free).
+
+### 2.3 Delta — small residual from basin
+
+The classic GGUF Q4_K case: a basin centroid plus a 4-bit delta. PR-X12's Delta mode generalises:
+
+- Per-CTU basin pointer (12 bits, 4096-entry codebook)
+- Per-cell residual (rANS-coded with per-tensor frequency table)
+
+Crucially, the residual is **rANS-coded with a Gaussian-tail prior** (R-10). GGUF's uniform 4-bit residual wastes ~0.3-0.5 bits per cell because the actual residual distribution is Laplacian/Gaussian, not uniform. PR-X12 closes that gap.
+
+**Estimated Delta-rate:** ~35% of weights, at an average of 2.5-3.5 bits each (counting basin pointer amortisation + Gaussian-tail rANS residual). Lower than GGUF's uniform 4.5 bpw.
+
+### 2.4 Escape — outlier, encode full
+
+For weights that are too extreme to fit any basin (the activation outliers that LLM.int8() and SmoothQuant fight over), encode as Escape + raw f16 value. ~3-5% of weights per layer, but they carry disproportionate information.
+
+The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate.
+
+---
+
+## 3. CTU quad-tree on weight matrices
+
+The CTU partition (M:E-G, R-2) is `Ctu<const N>` with leaf sizes ∈ {8, 16, 32, 64}. Applied to an LLM weight matrix:
+
+**The math:** a 4096 × 4096 attention weight tensor partitions into 64 × 64 = 4096 CTUs of 64×64 cells each, or finer. Tropical-GEMM RDO (R-7) chooses the optimal partition per CTU.
+
+**Why mixed quantization within a tensor matters:**
+
+GGUF's Q4_K_M uses *uniform* 4-bit blocks across the whole tensor. But empirically:
+
+- Output channels with high activation variance want 6-8 bit (Escape-dominant)
+- Output channels with low variance want 2-3 bit (Skip-dominant)
+- Most channels sit in the middle at 4 bit (Delta-dominant)
+
+GGUF can't express this — every block in `attn_q.weight` uses the same bit-width. PR-X12's RDO partition naturally chooses: a 16×16 block at 6-bit for an outlier-heavy region, a 64×64 block at 2-bit for a near-zero region, all within the same `attn_q.weight` tensor.
+
+**Concrete impact:** for the few output channels in attention that "carry" the attention sink behaviour (~5% of heads in a typical LLM), PR-X12 keeps them at 8-bit precision while compressing the bulk to 2-3 bit. GGUF would either over-quantize the sinks (causing attention pattern collapse) or over-allocate to all channels.
+
+**Cross-arch crossover (R-5):** the per-arch DCT crossover applies here too. On AMX-class hardware, the GEMM that consumes the decoded weights wants block-aligned 64×64 input; on Apple Silicon NEON, 32×32 is sometimes better. The CTU partition can be tuned per arch as a build flag — same model file, different optimum partition per target.
+
+---
+
+## 4. The basin codebook for LLM weights
+
+PR-X12's 4096-entry basin codebook (12-bit fingerprint) is the right size for LLM weight clustering. The training objective:
+
+```text
+Given a flat list of N weight vectors v_i ∈ ℝᵈ
+  (each v_i = a row or column slice of a tensor at the codebook's granularity)
+
+Find 4096 centroids c_1 .. c_4096 ∈ ℝᵈ
+  minimising  Σ_i ||v_i - nearest(v_i, {c_k})||²
+
+This is k-means on weight vectors. Per-tensor, per-layer-family,
+or model-global — the codebook policy lives in R-13.
+```
+
+**Granularity choices:**
+
+| Codebook scope | Codebook entries | Per-model storage | Compression quality |
+|---|---|---|---|
+| Per-tensor (every weight matrix has its own) | 4096 × n_tensors ≈ 4M entries | ~200 MB | Best, but storage-heavy |
+| Per-layer-family (Q+K+V+O share; gate+up+down share) | 4096 × 2 × 32 = 262K entries | ~13 MB | Good balance |
+| Per-architecture-family (one codebook for "all attention" of all layers) | 4096 × 4 = 16K entries | ~1 MB | Lower fidelity |
+| Model-global (one 4096-entry codebook) | 4096 entries | ~256 KB | Lossy on outlier layers |
+
+**Federated codebook policy (R-13) ships the per-layer-family codebook with the model file.** This is ~13 MB extra over the raw weights, paid once per model. The codebook is *the model*'s fingerprint — a Llama-3 codebook can't be used to decode a Qwen-3.5 file, but both ship the same PR-X12 binary.
+
+**Pretrained domain codebook (R-13 PretrainedStatic mode):** a single "LLM-family" codebook trained across many open-weight models could compress *any* LLM, at slightly lower fidelity than per-model codebooks. Useful for: shared model-distribution CDN, federated learning aggregation, or quick prototyping.
+
+---
+
+## 5. Activation-aware RDO (the GPTQ / AWQ trick, unified)
+
+GPTQ, AWQ, and Hadamard-based quantizers all amount to: "weight the RDO loss by the magnitude of expected activations through this row/column, from a calibration corpus." PR-X12's λ-RDO (A6) supports this natively:
+
+```text
+Standard codec RDO:
+    minimise  D(reconstructed, original) + λ · R(bitstream)
+
+Activation-aware RDO for LLM weights:
+    minimise  Σ_cells [ |a_c|² · (w_c - w'_c)² ]  +  λ · R(bitstream)
+                ↑ activation-magnitude weighting (from calibration corpus)
+```
+
+The codec body doesn't care — `D` is supplied by the caller (the GGUF-to-PR-X12 transcode tool). For an LLM use case:
+
+1. Run the model forward on a calibration corpus (~512 samples of natural text)
+2. Capture per-channel activation magnitudes
+3. Pass `|a_c|² ` as the per-cell distortion weight into the codec's RDO step
+4. Codec converges to a quantization that preserves high-activation channels
+
+This is **GPTQ + AWQ + SmoothQuant unified into one substrate**. Currently each is its own ~5 K-LoC codebase. The PR-X12 version is a callable function: `pr_x12_encode_tensor(tensor, activation_weights, λ) -> bitstream`.
+
+---
+
+## 6. Concrete numbers — Qwen 7B compression estimate
+
+Bottom-up estimate, using Skip/Merge/Delta/Escape rates from §2 and the GGUF baseline:
+
+| Tensor family | Param count | GGUF Q4_K_M | PR-X12 estimate |
+|---|---|---|---|
+| `token_embd.weight` + `output.weight` | 1.24 B | 720 MB (4.85 bpw) | 540 MB (3.5 bpw) — Skip-dominant rare-token rows |
+| `attn_q.weight` (32 layers) | 538 M | 313 MB | 235 MB (3.5 bpw) — mostly Delta |
+| `attn_k.weight` + `attn_v.weight` (32 layers) | 268 M | 156 MB | 78 MB (2.3 bpw) — Merge-dominant via GQA replication |
+| `attn_output.weight` (32 layers) | 538 M | 313 MB | 247 MB (3.7 bpw) |
+| `ffn_gate.weight` (32 layers) | 1.88 B | 1.09 GB | 750 MB (3.2 bpw) — sparse SwiGLU gating |
+| `ffn_up.weight` (32 layers) | 1.88 B | 1.09 GB | 800 MB (3.4 bpw) |
+| `ffn_down.weight` (32 layers) | 1.88 B | 1.09 GB | 800 MB (3.4 bpw) |
+| `attn_norm.weight` + `ffn_norm.weight` (32 layers) | 262 K | 0.4 MB | 0.05 MB — 95% Skip |
+| **Total weights** | **7.60 B** | **4.40 GB (4.85 bpw)** | **3.10 GB (3.42 bpw)** |
+| + Federated codebook | — | — | 13 MB |
+| **PR-X12 model file** | | **4.40 GB** | **3.12 GB** |
+
+**Compression ratio: ~29% smaller than GGUF Q4_K_M at equivalent perplexity.**
+
+For comparison:
+
+- GGUF Q3_K_M is ~3.3 GB at 3.91 bpw, with perplexity degradation of ~1-2% on Wikitext-103
+- PR-X12 estimate sits at ~3.1 GB at 3.42 bpw with target degradation < 0.5% (sub-Q3_K_M size, sub-Q4_K_M quality)
+- GGUF Q2_K is ~2.6 GB at 3.06 bpw with significant perplexity degradation (~5-10%)
+
+**Where the wins come from:**
+
+1. **Mixed quant within tensor** (§3): saves ~10% over uniform Q4_K_M
+2. **Gaussian-tail rANS residual** (R-10): saves ~0.3-0.5 bpw on Delta cells
+3. **Cross-head Merge in K/V projections**: saves ~50% on those tensors
+4. **Skip-rate at 32% average**: dominant contributor
+
+The estimate is conservative — real measurements will land between -25% and -35% versus Q4_K_M.
+
+---
+
+## 7. Streaming weight load — decode-during-GEMM
+
+Currently, llama.cpp / candle / burn load a GGUF file into memory in full, then dequantize per-tensor before the GEMM. PR-X12's wire format enables a different flow:
+
+```text
+Per GEMM operation (e.g., compute attn_q @ x for batch):
+
+  for each output row r in attn_q:
+      decode CTU bitstream for row r:
+          - if Skip: weight = basin_centroid (4 ns lookup)
+          - if Merge: weight = neighbour value already in register
+          - if Delta: weight = basin_centroid + rANS-decoded residual
+          - if Escape: weight = raw f16 (rare, ~3-5%)
+      accumulate: out[r] += weight @ x  (immediate, before next row)
+```
+
+The CTU bitstream is read forward-only (rANS is a streaming codec) and the decoded weights live in L1/L2 cache just long enough to be GEMM'd. **No full-tensor dequantize buffer needed.** For a 4096 × 4096 attention projection, the dequantize buffer would be 32 MB (f16); PR-X12 streams in ~3-4 MB of bitstream, decodes to ~64 KB cache-resident windows, GEMMs each window, drops it.
+
+**Memory savings:** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." A 7B model at PR-X12 is genuinely runnable on a phone-class device, where GGUF Q4 is borderline.
+
+**Latency:** the streaming decode happens in the same loop body as the GEMM accumulate. On a modern arch with VNNI + AMX, the decode cost (~5-10 cycles per cell, branchless via R-1's lookup-table pattern) is hidden by GEMM latency. **Estimated overhead: < 5% versus pre-dequantized GEMM.**
+
+This is the architecture that R-11 (latency assertion) was designed to gate: the decode-during-GEMM path *must* clear within 1.05× of the pre-dequantized baseline, or the streaming win evaporates.
+
+---
+
+## 8. The inference math is unchanged
+
+Critically: **PR-X12-encoded weights produce the same matmul output as the original f16 weights**, up to the quantization noise floor. The codec does not change:
+
+- Layer norm formula
+- Attention softmax
+- SwiGLU activation
+- RoPE rotation
+- KV cache layout
+
+Only the **storage format** of the weight tensors changes. The GEMM kernel (`ndarray::hpc::blas_level3::gemm`) gets bf16 or int8 inputs after decode; everything downstream of GEMM is identical.
+
+This is why PR-X12 + GGUF is a **drop-in replacement**, not a model retrain. Take a Qwen 3.5 7B GGUF file, run `pr_x12_transcode_gguf input.gguf output.prx12`, ship the output. Decode side: candle or burn loads the .prx12 file via a new codec adapter; inference proceeds identically.
+
+The hard part — and the falsifier — is whether the activation-aware RDO actually produces the same perplexity. Plan G's model-lane (proposed below) is the empirical check.
+
+---
+
+## 9. Bench plan (extends Plan G with a model-weight lane)
+
+Add to Plan G (per R-4, R-11) a fourth lane:
+
+| Lane | Source | Pass criterion |
+|---|---|---|
+| Video | Big Buck Bunny 1080p | ≥ 0.95× x265 ultrafast PSNR @ -0.1 dB (R-4) |
+| 3DGS | Mip-NeRF 360 garden scene | ≥ 30× over PLY-trim (R-10) |
+| Gradient | ResNet-50 ImageNet SGD logs | Match QSGD compression (HG4) |
+| **NEW: LLM weights** | **Qwen 3.5 7B GGUF Q4_K_M** | **≤ 3.2 GB encoded + perplexity Δ ≤ 1.0% on Wikitext-103** |
+
+**Sub-targets within the LLM lane:**
+
+1. Transcode time: ≤ 10 minutes on a single SPR socket for a 7B model (offline, one-time)
+2. Decode-during-GEMM overhead: ≤ 5% vs pre-dequant baseline (R-11 assertion)
+3. Streaming memory: decode scratch ≤ 1 MB at any moment (peak)
+4. Perplexity preservation: Δ ≤ 1% on Wikitext-103 versus original f16 weights
+5. Codebook size: federated codebook ≤ 15 MB per model
+
+Failing any sub-target makes the LLM lane informational-only; failing all four blocks the LLM lane from claiming the win.
+
+**Suggested implementation cost:** 2-3 weeks for the transcode tool (Rust, builds on existing `ndarray::hpc::cam_pq::kmeans` + R-1 basis trait). 1-2 weeks for candle integration. 1 week for bench. Total: ~5-7 weeks from PR-X12 codec merge.
+
+---
+
+## 10. Falsifiers
+
+What kills this path? Listed by likelihood:
+
+**F-1: Activation-aware RDO doesn't beat GPTQ/AWQ.** If PR-X12's RDO under-performs the hand-tuned per-tensor quantizers, the win evaporates. **Mitigation:** Plan G's perplexity assertion is the check. If λ-RDO is within 0.5% of AWQ on benchmark, ship. If not, the codec stays at uniform-bit quant (still a 5-10% storage win from Gaussian-tail rANS alone) and AWQ-style quantization stays orthogonal.
+
+**F-2: Streaming decode breaks GEMM dispatch.** The decode-during-GEMM loop has tight register pressure. If the codec decode steals enough registers from the GEMM kernel, throughput drops below the 1.05× threshold. **Mitigation:** R-11 latency CI catches this. Worst case: bench detects, codec falls back to pre-dequant path (lose streaming-memory win, keep storage win).
+
+**F-3: Federated codebook size grows.** If per-layer-family codebooks need > 30 MB at acceptable fidelity, the overhead vs Q4_K_M's metadata grows substantially. **Mitigation:** R-13's PretrainedStatic mode (single LLM-family codebook) can fall back to ~1 MB at slightly lower fidelity. Tradeoff is exposed at transcode time.
+
+**F-4: Outliers can't be encoded losslessly.** If Escape mode's lossless f16 fallback is incompatible with the rANS state machine (e.g., needs out-of-band raw bytes), the wire format becomes mixed-stream — bad for streaming decode. **Mitigation:** reserve a small bypass channel in the framing layer (A8) for raw escapes; the rANS coder handles ~95% of cells, the bypass handles the 5% outliers. This is the same pattern HEVC uses for escape coefficients.
+
+**F-5: Llama.cpp ecosystem fork.** If PR-X12-encoded weights need a new file extension and new loader code, the GGUF ecosystem (active community, ~5 years of momentum) won't adopt. **Mitigation:** ship a `pr-x12` extension *inside* a GGUF v3 file format, registered as a new quantization type (Q_PRX12). Llama.cpp can add it via a small contributor PR. The codec becomes a GGUF quantization variant, not a replacement file format.
+
+---
+
+## 11. What this lens prescribes for PR-X12 scope
+
+Concrete implications:
+
+1. **Do not** widen the codec body to accept "model weights" as a special case. Per R-3, the codec body stays generic. Model-weight encoding is a *consumer* of the codec, not a fork of it.
+
+2. **Do** ship the codec with the bench harness lane structure that allows new lanes to be added (per R-4). The LLM lane lands post-PR-X12, but the harness must be lane-extensible.
+
+3. **Do** export the activation-weighted RDO interface explicitly. `pr_x12_encode_tensor(tensor, distortion_weights, lambda)` — `distortion_weights` is `None` for video (uniform weight per pixel), `Some(activation_magnitudes)` for LLM weights. Same function, different param.
+
+4. **Do** keep R-13's federated codebook policy. The LLM use case is the strongest motivation: per-model codebooks are 13 MB; without R-13, a hard-coded codebook would not work for arbitrary LLMs.
+
+5. **Reserve** an `EncodingDomain::LLMWeights` discriminant in the codec metadata header (separate from the 16-bit per-CTU header). The codec body doesn't read this — it just stamps the file with a domain tag so decoders know which basin codebook to load.
+
+6. **Bench against AWQ at parity perplexity, not just Q4_K_M.** Q4_K_M is a conservative baseline; AWQ + GPTQ are the actual state of the art. If PR-X12 can match AWQ at smaller storage, the case is strong; if not, ship at "drop-in GGUF replacement" framing only.
+
+---
+
+## 12. The deeper claim
+
+The four loads in the PR-X12 multi-domain thesis (M:H-1, HG1) are:
+
+1. Video frames
+2. 3D Gaussian splats
+3. Attention KV caches
+4. Gradient streams
+
+This doc adds a **fifth load** that the original thesis didn't enumerate:
+
+5. **Static LLM weight tensors**
+
+The fifth load is interesting because it's *what GGUF already does, badly*. Every quantized-LLM-deployment problem solved by GGUF — model distribution, edge inference, memory-constrained loading — is *more cleanly* solved by PR-X12. The community has built a parallel codec ecosystem (Q4_K_M, AWQ, GPTQ, EXL2, IQ2_XXS) that converges step-by-step toward what PR-X12 already specifies.
+
+The economic stake: **every LLM deployment** — Open WebUI, llama.cpp, candle apps, Ollama, LM Studio, vLLM — ships GGUF. Even a 20% storage reduction across that ecosystem is hundreds of GB saved per model release, and millions of dollars in CDN costs per month at the Hugging Face / Replicate scale.
+
+**PR-X12 inherits the LLM weight compression market by being a strictly more general codec, requiring only a transcode tool and a candle/burn adapter.** No retraining, no new training pipelines, no model-architecture changes. Just a smaller file that produces the same logits.
+
+---
+
+## 13. Cross-references
+
+- **Substrate canon:** `pr-x12-substrate-merged-canon.md`
+- **Resolutions:** R-3, R-4, R-5, R-10, R-11, R-13 in `pr-x12-canon-resolutions-delta.md`
+- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md` (the streaming-decode pattern is the same as ME-via-SSD)
+- **3DGS lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md` (sibling load #2)
+- **WoA orchestration:** `pr-x12-woa-multiarch-orchestration.md` (per-arch dispatch for the decode-during-GEMM kernel)
+- **Anti-neural lens:** `pr-x12-anti-neural-lookup-inversion.md` (k-means basin as frozen 1-layer NN — relevant to the codebook training story here)
+- **Codec spec:** `pr-x12-codec-x265-design.md`
+- **Reading list:**
+  - GGUF spec: `ggerganov/ggml` repo `docs/gguf.md`
+  - GPTQ (Frantar et al. 2022)
+  - AWQ (Lin et al. 2023)
+  - SmoothQuant (Xiao et al. 2023)
+  - LLM.int8() (Dettmers et al. 2022)
+  - IQ2_XXS llama.cpp PR — current "lookup-table quant" closest to PR-X12 shape
+- **Adjacent code:**
+  - `src/hpc/cam_pq.rs` — k-means kernel for basin codebook training
+  - `src/hpc/quantized.rs` — Int8 GEMM (where decode-during-GEMM would dispatch)
+  - `src/hpc/blas_level3.rs::gemm` — the inner-loop matmul that consumes decoded weights
+  - `candle` / `burn` integration points (in their respective repos)
+
+_Last edit: 2026-05-22._
+_Status: perspective doc; the LLM-weight lane is post-PR-X12 scope (2-3 months after merge)._
diff --git a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md
new file mode 100644
index 00000000..5bd633ba
--- /dev/null
+++ b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md
@@ -0,0 +1,1288 @@
+# PR-X12 — Substrate Canon Resolutions
+
+> Date: 2026-05-22
+> Status: **canon supplement** — resolves the eighteen open items raised in
+> the review of `pr-x12-substrate-merged-canon.md` (PR #196). Additive to
+> the canon, not a replacement.
+> Reads after the canon. Cite from this doc as `R-N` (resolution N).
+
+---
+
+## 0. How to read this doc
+
+The merged canon (`pr-x12-substrate-merged-canon.md`, master commit
+`bc9da4ad`) is the single point of architectural truth. It successfully
+fuses session A (`pr-x12-codec-cognitive-substrate-mapping.md`) and
+session B (`pr-x12-cross-domain-synergies.md`). What it does NOT yet do
+is commit to concrete shapes for the load-bearing pieces. Eighteen items
+were raised in review:
+
+- **§3** — five things the merge merged well (confirmed, one-liners)
+- **§4** — four items the merge raised in abstraction but did not commit
+  (R-1 through R-4 resolutions)
+- **§5** — three pieces of detail from session A the merge underrepresented
+  (R-5 through R-7 restorations)
+- **§6** — three pieces of detail from session B the merge underrepresented
+  (R-8 through R-10 restorations)
+- **§7** — three commitments missing from both originals and from the
+  merge (R-11 through R-13 new specs)
+
+Then five integration pieces that make the resolutions actionable:
+
+- **§8** — the canonical contracts (trait signatures for `PredictiveSignal`,
+  `LinearReduce<Basis>`, `Basis<T>`, `CurveOrder<const N>`, `RdoMetric`)
+- **§9** — falsifiability matrix (every claim → criterion → test → pass)
+- **§10** — sequencing diagram with named gates
+- **§11** — end-state + trajectory (think it from the end)
+- **§12** — compaction-preservation contract
+
+Citation IDs: `R-1` through `R-13` for resolutions. Canon IDs (`M:E-*`,
+`A:E-*`, `B:E-*`, `M:H-*`, `M:T-*`) remain stable; this doc adds, does
+not renumber.
+
+Sister docs (read order):
+
+1. `pr-x12-codec-x265-design.md` — mechanical spec
+2. `pr-x12-substrate-merged-canon.md` — architectural fusion (THE canon)
+3. **this doc** — resolutions of opens in the canon
+4. `pr-x12-codec-cognitive-substrate-mapping.md` — session A archeology
+5. `pr-x12-cross-domain-synergies.md` — session B archeology
+
+---
+
+## 1. The end state — think it from the end
+
+Where this lands if every plan ships:
+
+```text
+                       ┌────────────────────────────────────────┐
+                       │  $ codec-bench --mode video    --input scene.y4m   │
+                       │  $ codec-bench --mode splat    --input scene.ply   │
+                       │  $ codec-bench --mode kv-cache --input kv.bin      │
+                       │  $ codec-bench --mode gradient --input grad.lance  │
+                       │                                                    │
+                       │  → all four emit compressed Lance columns          │
+                       │  → all four meet their threshold (§9)              │
+                       │  → all four share ~1.5 KLoC generic codec body     │
+                       │  → each ships ~200 LoC of trait impl               │
+                       └────────────────────────────────────────┘
+```
+
+**Five-category architecture, codec is its own layer:**
+
+```text
+ndarray         = hardware       (SIMD, Palette, Base17, SpoDistanceMatrices)
+ndarray-codec   = compression substrate  ← extracted via Plan H
+                  (Ctu, LeafCu, PredictiveSignal, LinearReduce, CurveOrder, rANS, RDO)
+lance-graph     = thinking       (NarsTruth, TripleModel, AutocompleteCache)
+causal-edge     = protocol       (CausalEdge64, NarsTables)
+p64             = convergence    (where ndarray + lance-graph meet)
+```
+
+**Three plug-points factor everything domain-specific out of the codec**
+(per M:E-E + R-1 below): Transform basis, Curve order, Escape payload.
+Anything domain-specific that does not fit one of these three is a sign
+that the abstraction is wrong, not that the codec needs growth.
+
+**Single binary `codec-bench`** is the falsifiability proof of M:H-NEW-1.
+The binary, not an argument, demonstrates HG1 / HG6 / M:H-NEW-1. Plan G
+(§5 of canon, §10 here) builds it before A7 rANS ships.
+
+This is the end state. The trajectory in §11 is how we get there.
+
+---
+
+## 2. The trajectory
+
+```text
+T+0 weeks    Phase 0 starts — substrate gates
+   T+0    : Plan H (extract ndarray-codec, 3 days, parallel)
+   T+0    : Plan I (PredictiveSignal trait, 3 days, parallel)
+   T+0    : Plan A4-design (Transform trait shape, 1 day, parallel)
+   T+0    : Plan G (multi-domain bench, 2 weeks, the gate)
+
+T+2 weeks  Phase 0 closes. Plan G binary exists, runs on 4 inputs.
+   T+2    : Plan A7 starts (1.5 weeks, CRITICAL PATH)
+
+T+3.5 weeks  Plan A7 lands. Compression-ratio thresholds testable.
+   T+3.5  : Plan A4-impl (1 week, parallel)
+   T+3.5  : Plan B / A3-inter (1 week, parallel)
+   T+3.5  : Plan C / EWA SYRK-batched (1 week, parallel)
+   T+3.5  : Plan A6 (1 week, parallel)
+   T+3.5  : Plan A8 (1 week, parallel)
+
+T+4.5 weeks  Phase 1 closes. Codec mechanism complete.
+   T+4.5  : Plan E (3DGS coefficient codec, 3 weeks, 2 workers)
+   T+4.5  : Plan D (attention codec, 2 weeks, 2 workers, can run parallel)
+   T+6.5  : Plan F (federated SGD, 4 weeks, 2 workers, after Plan D)
+
+T+10.5 weeks  All four consumer integrations land.
+   T+10.5 : Plan G thresholds re-run against all four loads.
+   T+10.5 : M:H-1 through M:H-9 all unlocked (or falsified — see §9).
+```
+
+Critical path: **Plan G → Plan A7**. Everything else parallelises after
+Plan A7. Total ~10.5 weeks of wall-clock work; ~2 workers steady-state
+through Phases 0 and 1, ramping to 6 workers in Phase 2.
+
+---
+
+## 3. What the merge merged well (preserved)
+
+Five pieces of synthesis that genuinely emerge from putting A and B
+side by side. None appear in either original. Confirmed as canon:
+
+- **M:E-A** — `LinearReduce<Basis>` unifies α-composite / rANS /
+  sum-reduce / softmax as the *same* matrix-vector reduce. A's E-4
+  (transform = optimizer) and B's E9 (mode-decide + reduce = same
+  kernel) collapse into one trait.
+- **M:E-D** — Fifth crate category `ndarray-codec`. Both originals
+  saw the dep-cycle; neither named the resolution.
+- **M:E-G** — `Ctu<const N: usize>` reconciles 64×64 (cognitive) and
+  16×16 (splat) at the type level. A treated as invariant; B as debt;
+  merge factors via const-generic.
+- **M:E-I** — Trait isomorphism (`PredictiveSignal`) over code-folding
+  for `splat.rs` vs `Fingerprint`. Shared interface, not shared types.
+- **M:E-F** — A7-first critical path BUT commit A4-design trait shape
+  first (1 day). Resolves A-vs-B sequencing dispute correctly.
+
+These five are the canon's load-bearing pieces. R-1 through R-13
+below resolve what these five did not yet commit.
+
+---
+
+## 4. Resolutions: items the merge raised but did not commit
+
+### R-1 — `LinearReduce<Basis>` and `Basis<T>` trait signatures
+
+**Problem.** M:E-A and M:H-NEW-2 invoke `trait LinearReduce<Basis>` as
+the unifying surface but the canon never gives the signature. Without
+it, Plan A7 is written against an unknown shape.
+
+**Resolution.** Commit the trait pair at Plan A4-design time (1 day,
+Phase 0). The shape:
+
+```rust
+/// A basis for a linear reduction. Implementors define a small dense
+/// (or sparse) matrix and how to apply it to a `[T; dim()]` input.
+///
+/// Concrete impls land in their natural homes:
+/// - `IdentityBasis<T>`         in `ndarray-codec::basis`
+/// - `DctIIBasis<const N>`      in `ndarray::hpc::fft`
+/// - `AdamPrecondBasis`         in `burn-codec` (consumer)
+/// - `KFACBlockBasis`           in `burn-codec` (consumer)
+/// - `ShSpectralBasis<L>`       in `ndarray::hpc::splat3d`
+pub trait Basis<T: Copy> {
+    /// Dimension of the basis (square: dim()×dim()).
+    fn dim(&self) -> usize;
+
+    /// Apply the basis: `dst = B · src`. Caller pre-allocates dst.
+    /// Length contract: `src.len() == dst.len() == self.dim()`.
+    fn apply(&self, src: &[T], dst: &mut [T]);
+
+    /// Inverse: `dst = B⁻¹ · src`. Same length contract.
+    /// For orthogonal bases (DCT, Hadamard) this is `Bᵀ · src`.
+    fn invert(&self, src: &[T], dst: &mut [T]);
+}
+
+/// A linear reduction over a sequence of symbols against a basis,
+/// producing a single output. The kind of reduction depends on impl:
+/// - alpha-composite (3DGS rasterizer): RGB blending
+/// - rANS-encode (codec A7): state-machine accumulation
+/// - sum-reduce (SGD all-reduce): cross-worker summation
+/// - softmax (attention): exp-normalize-multiply
+pub trait LinearReduce {
+    type Symbol: Copy;
+    type Output;
+    type Basis: Basis<Self::Symbol>;
+
+    /// Reduce a single sequence of symbols against the basis.
+    fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output;
+
+    /// Batched reduction: each row is one sequence. Returns one output
+    /// per row. Implementors may dispatch to BLAS GEMM for large batch.
+    fn reduce_batch(
+        &self,
+        src: &[&[Self::Symbol]],
+        basis: &Self::Basis,
+    ) -> Vec<Self::Output>;
+}
+```
+
+**Why two traits, not one.** The basis is data; the reduction is logic.
+Same basis (e.g. DCT-II 8×8) is used by both the transform path (codec
+A4) and the EWA splat path (matrix-vector product). Separating lets a
+basis ship once and serve many reductions.
+
+**Why no const generic on `Basis::dim()`.** The codec needs to handle
+4×4, 8×8, 16×16, 32×32 DCT-II blocks at runtime per CTU split depth.
+A const-generic basis would force depth at the type level — wrong
+factoring. The compile-time win comes from monomorphising over the
+*reduction* type (which is single per consumer); the basis dim is a
+runtime knob.
+
+**Falsifies.** If a consumer needs to subclass `LinearReduce` to make
+their reduction work (e.g. splat-rasterizer demands access to depth
+buffer), the trait factoring is wrong and Plan A7 will accumulate
+domain-specific code. Plan G's bench harness is the gate that catches
+this — it runs all four reductions through the same trait.
+
+**Cite as R-1 in PR descriptions touching A4 or A7.**
+
+---
+
+### R-2 — Bits 14-15 of the leaf header: cross-load contention
+
+**Problem.** M:E-J claims bits 14-15 of the 16-bit leaf header for
+cognitive Pearl-rung metadata (`{Observation, Intervention, Counter-
+factual, inter-tier-link}`). A:E-15 had reserved the same bits for
+inter-tier reference. The canon does not say what video / splat /
+gradient consumers do with these bits, and Plan E (3DGS) ships in
+Phase 2 before the reservation is pinned in Plan A8.
+
+**Resolution.** Split the two reserved bits asymmetrically.
+
+```text
+bit 15  ──  UNIVERSAL: "has inter-tier reference"
+            0 = leaf is self-contained
+            1 = leaf refers to a parent-tier `LeafCu` (A3-inter)
+            All four consumers respect this bit identically.
+
+bit 14  ──  CONSUMER-TYPED: semantic owned by `ConsumerProfile`
+            cognitive: bit 14 = Pearl rung high bit
+                       (combined with mode bits 12-13 if rung 4
+                        wanted; today rungs 1-3 + reserved = 2-bit
+                        encoding using just bit 14)
+            video    : bit 14 = 0 (reserved)
+            splat    : bit 14 = LOD-cascade-source flag
+            gradient : bit 14 = worker-shard parity (for FRC)
+```
+
+**Frame header carries the `ConsumerProfile` tag** (Plan A8). 2-bit
+field at frame boundary. Decoders route bit-14 interpretation per
+profile. Cognitive consumer gets the Pearl-rung high bit; others
+reuse bit 14 for their own semantic without protocol break.
+
+**Why not put causal metadata in the frame header instead.** Per-leaf
+granularity matters: causal direction can change per cell in a
+cognitive scene, but profile is per-frame. Bit 14 must be leaf-local.
+
+**Why not consume both bits per profile.** Bit 15 must stay universal
+because A3-inter (cross-tier reference) is generic across consumers —
+the LOD cascade applies to all four loads.
+
+**Plan A8 implementation note.** The 2-bit `ConsumerProfile` lives in
+the frame header alongside the per-frame basin codebook ref + rANS
+frequency table. Decoders mask bit 14 of every leaf header through a
+profile-specific demultiplexer before exposing to the consumer.
+
+**Cite as R-2 in A3-inter and Plan A8 PR descriptions.**
+
+---
+
+### R-3 — M:H-NEW-2 LoC budget: actual current count + commitment
+
+**Problem.** M:H-NEW-2 claims `<1.5 KLoC generic codec glue + <200 LoC
+per domain consumer`. The canon does not state current LoC nor pin
+the budget envelope.
+
+**Resolution.** Measure now, commit the budget envelope, audit per PR.
+
+**Current LoC on master commit `bc9da4ad`** (post PR #195 + PR #196):
+
+| File | Total LoC | Approximate breakdown |
+|------|-----------|-----------------------|
+| `src/hpc/codec/ctu.rs` | 771 | partition machinery + LeafCu types |
+| `src/hpc/codec/mode.rs` | 518 | bit-pack/unpack helpers |
+| `src/hpc/codec/predict.rs` | 511 | intra-prediction decision tree |
+| `src/hpc/codec/mod.rs` | 38 | re-exports |
+| **Total** | **1838** | with tests + doctests + comments |
+
+Of the 1838 total, my read of the files: **~600 lines is non-test,
+non-doc-comment generic code**, ~800 lines is inline tests, ~400 lines
+is doc-comment / doctest blocks, ~38 lines is mod.rs glue.
+
+**Generic-code LoC currently ~600.** M:H-NEW-2's `<1.5 KLoC generic
+glue` budget allows another ~900 lines for A4 (transform), A6 (RDO),
+A7 (rANS), A8 (stream), A3-inter (cross-tier).
+
+**Per-sub-card LoC envelope (committed):**
+
+| Sub-card | Generic-glue LoC envelope | Rationale |
+|----------|---------------------------|-----------|
+| A4 transform | ≤200 | DCT-II + Identity + Transform trait |
+| A6 RDO | ≤150 | λ-RDO + RdoMetric trait |
+| A7 rANS | ≤300 | encoder + decoder + per-frame freq table |
+| A8 stream | ≤200 | framing + ConsumerProfile demux (R-2) |
+| A3-inter | ≤100 | extend IntraContext with parent-tier slot |
+| **Total budget** | **≤950** | leaves ~50 LoC margin |
+
+**Per-consumer LoC envelope (committed):**
+
+| Consumer | Generic-glue LoC envelope | What ships |
+|----------|---------------------------|------------|
+| `splat3d::codec` (Plan E) | ≤200 | `impl PredictiveSignal for GaussianSplat` + Morton `CurveOrder` |
+| `attention-codec` (Plan D) | ≤200 | `impl PredictiveSignal for AttentionSlot` + token-seq curve |
+| `grad-codec` (Plan F) | ≤200 | `impl PredictiveSignal for GradientWeight` + layer-seq curve |
+| `video` (Plan G consumer side) | ≤200 | `impl PredictiveSignal for VideoCell` + raster curve |
+| **Per-consumer total** | **≤800** | sum across four consumers |
+
+**Audit rule.** Every PR introducing or modifying generic-codec code
+must include a one-line generic-LoC delta in the body. If the cumulative
+delta exceeds the envelope, the PR escalates to architectural review
+(not a CR-style nit; a real "is the abstraction wrong?" question).
+
+**Falsifies M:H-NEW-2 if.** Generic-glue LoC exceeds 1500 after A4-A8
+land + at least one consumer integration. That's the falsifiability
+condition; tracked in Plan G's metrics report.
+
+**Cite as R-3 in any PR body modifying `src/hpc/codec/`.**
+
+---
+
+### R-4 — Plan G falsifiability thresholds
+
+**Problem.** Plan G ships "a single binary that ingests video / 3DGS /
+KV cache / gradient stream and emits compressed Lance columns + ratio
++ reconstruction error". The canon does not name a pass threshold per
+load.
+
+**Resolution.** Commit four threshold pairs (compression ratio + quality
+floor). Failure to clear any threshold blocks the corresponding consumer
+PR landing.
+
+| Load | Reference baseline | Compression target | Quality floor |
+|------|-------------------|--------------------|--------------|
+| **Video** | x265 `--preset ultrafast` at CRF 23 on Big Buck Bunny 1080p | ≥0.95× reference ratio | PSNR within ±0.1 dB of reference |
+| **3DGS** | Inria stock PLY-trim on Mip-NeRF 360 (garden scene) | ≥30× over PLY-trim raw | SSIM ≥ ref − 0.005 at same SH-order |
+| **KV cache** | FP16 raw cache, Llama-3-8B-Instruct, 64K context, RULER benchmark | ≥4× over raw FP16 | downstream RULER score loss ≤0.5 % |
+| **Gradient** | BERT-large fine-tune on GLUE-MNLI, signSGD baseline | ≥8× over signSGD raw | final validation-loss delta ≤0.5 % |
+
+**Three-way pass criterion** per load:
+
+1. **Ratio threshold cleared** — measured during Plan G run
+2. **Quality floor cleared** — measured during reconstruction
+3. **Per-consumer LoC envelope respected** — per R-3 audit
+
+All three must pass for the consumer's holy-grail claim to count as
+demonstrated rather than asserted.
+
+**Sub-threshold = blocker.** If any of (ratio, quality, LoC) fails for
+a load, the corresponding consumer plan (D / E / F / video) cannot
+claim "complete". The merged canon's M:H-1 through M:H-9 are then
+provably partial; only the cleared loads count.
+
+**Why these thresholds and not stricter.** Conservative initial bars:
+- Video at parity with x265 ultrafast is meaningful (PR-X12 is supposed
+  to *generalise* x265, not beat it at its specialty)
+- 30× over Inria PLY-trim is the floor for "this changes 3DGS streaming"
+- 4× KV-cache compression at <0.5% accuracy = passes the smell test
+  against StreamingLLM / H2O / SnapKV
+- 8× gradient over signSGD = roughly the rANS theoretical floor for
+  heavy-tail-distributed gradients
+
+**Stretch targets** (recorded separately, not blocking):
+- Video at 1.5× x265 ultrafast at same PSNR (would justify HG1 strongly)
+- 3DGS at sub-1-bit/Gaussian (M:H-6 / B:HG2 — see R-10 for math)
+- KV cache at 8× (matches the FlashAttention-3 ceiling)
+- Gradient at 16× (peer-reviewed federated-SGD upper bound)
+
+**Cite as R-4 in Plan G's PR description; the binary's `--threshold`
+flag must enforce all four pass criteria.**
+
+---
+
+## 5. Restored detail from session A
+
+### R-5 — DCT-II vs GEMM crossover at 64 blocks (from A:§5.3)
+
+**Problem.** The merge punts to Plan A4-impl without preserving the
+operational decision rule for transform dispatch.
+
+**Resolution.** Pin the crossover number in Plan A4-impl's spec.
+
+**Decision rule for A4 transform dispatch:**
+
+```text
+N = number of contiguous transform blocks to apply
+
+if N <  64:  per-block butterfly path
+             ~80 ops/block for 32×32 DCT-II via Loeffler/Lengwehasatit
+             Fits L1 trivially; no batching cost
+
+if N >= 64:  batched GEMM path
+             ~32K ops/block (matrix form) but 256 blocks/cycle in AMX bf16
+             ~128 KB working-set, fits L1
+             Amortises hardware fusion + reduces dispatch overhead
+
+Crossover empirically at ~64 blocks on Sapphire Rapids; calibrate
+per architecture during A4-impl.
+```
+
+**Why crossover at 64.** AMX TDPBF16PS does one 16×16 BF16 tile per
+cycle. 64 blocks at 32×32 → 256 tile operations → ~256 cycles for
+batched GEMM. The per-block butterfly at 80 ops/block × 64 blocks =
+5120 ops, which at ~4 IPC = 1280 cycles. Crossover is approximate;
+real measurement during A4-impl pins per-arch.
+
+**Per-architecture override matrix (Plan A4-impl deliverable):**
+
+| Architecture | Per-block path | Crossover N | Batched path |
+|--------------|----------------|-------------|--------------|
+| Sapphire Rapids (AMX-BF16) | Loeffler 1D + transpose | ~64 | AMX TDPBF16PS via `bf16_tile_gemm` |
+| Skylake-X / Ice Lake (AVX-512F) | Loeffler 1D + transpose | ~32 | AVX-512 ZMM batched DCT |
+| Zen 4 (AVX-512) | Loeffler 1D + transpose | ~96 | AVX-512 ZMM (no AMX) |
+| Apple Silicon (NEON) | Loeffler 1D | ~256 | NEON 4×4 GEMM via `bf16_tile_gemm` NEON stub |
+
+**Cite as R-5 in A4-impl PR descriptions.**
+
+---
+
+### R-6 — SSD reformulation for VNNI block-match ME (from A:E-7)
+
+**Problem.** Merge cites "Block-matched ME via i8gemm" without the
+SSD reformulation math. That math is what *proves* ME goes through
+BLAS at all; without it the BLAS-synergy claim is decorative.
+
+**Resolution.** Restore the math and the speedup citation.
+
+**SAD (HEVC native) — not a GEMM:**
+
+```text
+SAD(A, B) = Σ_{ij} |A_{ij} - B_{ij}|
+```
+
+The absolute-value inside the sum has no matrix shape.
+
+**SSD (PR-X12 reformulation) — has a GEMM:**
+
+```text
+SSD(A, B) = Σ_{ij} (A_{ij} - B_{ij})²
+         = Σ A_{ij}² - 2·Σ A_{ij}·B_{ij} + Σ B_{ij}²
+         = ||A||² - 2·(A·B) + ||B||²
+                       ▲
+                       │
+                       └── this term IS a GEMM
+```
+
+**For N motion-vector candidates** at one reference block:
+
+```text
+Candidates  A_1, A_2, ..., A_N     each 16×16 = 256 pixels = 256-d vector
+Reference   B                       16×16 = 256-d vector
+
+Middle term: A_batch @ B            (N×256) @ (256×1) = N×1
+                                    one GEMV; or for batched ME
+                                    over multiple reference blocks,
+                                    N×K matrix.
+
+||A_i||²    precomputed once per candidate window
+||B||²      precomputed once per reference
+```
+
+**VNNI VPDPBUSD throughput:** 64 i8·i8 → i32 dot-product ops per cycle
+on Cascade Lake+ . One 256-element dot product = 4 VPDPBUSD ops = ~4
+cycles. Vs hand-tuned SAD via VPSADBW: ~8 cycles per 16-pixel row, so
+~128 cycles per 16×16 SAD. **Speedup: ~32× to ~50× depending on
+batch dispatch.**
+
+**Implication for PR-X12 E-7 (block-matched ME via i8gemm):** ME path
+in A4 or A5 ships as a `batched_ssd_search` primitive in `ndarray::hpc::
+blas_level2` that downstream consumers (video, splat scene flow) call
+into. **Not a codec-specific function** — landing in BLAS L2 keeps the
+factoring clean (codec uses the math; BLAS owns the math).
+
+**Cite as R-6 in any ME-path or splat scene-flow PR description.**
+
+---
+
+### R-7 — CTU partition as tropical-GEMM (from A:§13.3)
+
+**Problem.** Merge mentions "tropical-GEMM" in §11 Phase 3 but drops
+the `O(4^d) → O(d²)` complexity bound. That bound is the architectural
+justification for the `lance-graph::blasgraph` dependency.
+
+**Resolution.** Restore the complexity argument and pin the algorithm.
+
+**HEVC's recursive partition RDO:**
+
+```text
+For each CTU at depth d:
+  for each of 4 children:
+    recursive RDO at depth d+1
+  combine children's costs
+
+Time: O(4^d) where d = max split depth (4 in PR-X12, giving 256 nodes
+worst case per CTU)
+```
+
+**Tropical-semiring formulation (R-7 commitment):**
+
+```text
+1. Represent the 85-node tree as a DAG (parent → child edges).
+2. Edge weights W[parent, child] = ΔRDO cost of choosing child.
+3. Compute shortest-path costs to every node via matrix relaxation:
+
+     D ← min(D, D + W)     ← tropical-GEMM iteration
+
+   Repeat for d iterations where d = depth.
+4. Optimal partition = argmin_n D[root, n] for n in leaf nodes.
+
+Time: O(d² × |nodes|) using batched tropical-GEMM on `lance-graph::
+blasgraph`. For d=4, |nodes|=85: O(16 × 85) = O(1360) ops per CTU.
+Vs. O(4^4 × |nodes|) = O(21,760) ops for the naive recursive RDO.
+```
+
+**Speedup: ~16×.** For a 4K frame at ~132K CTUs, this is the difference
+between ~4 ms and ~64 ms per frame just for partition RDO. At 60 fps,
+that's the difference between fitting and missing the latency budget.
+
+**Why this needs `lance-graph::blasgraph`:** Standard BLAS GEMM uses
+(× , +) semiring. Tropical uses (+ , min) semiring. blasgraph already
+ships tropical-GEMM kernels. No new code in ndarray; cross-repo dep
+from ndarray-codec → lance-graph::blasgraph (after Plan H extraction,
+this is dep-allowed because ndarray-codec is a sibling, not the bottom).
+
+**Plan A6 RDO (1 week) ships this.** The λ-RDO knob (per A:§10.3) and
+the tropical-GEMM partition solver are the same kernel: λ scales the
+edge weights, the relaxation computes the optimal mode tree.
+
+**Cite as R-7 in Plan A6 PR description; required reading for anyone
+touching `RdoConfig` or `predict_intra` policy.**
+
+---
+
+## 6. Restored detail from session B
+
+### R-8 — Plan G framing: confidence-degradation gate
+
+**Problem.** Merge promoted my B:D-STACK-13 (no multi-domain bench
+harness) to Plan G + M:E-H but lost the rationale for *why* it goes
+in Phase 0 vs Phase 1.
+
+**Resolution.** Make the framing explicit in canon and in this doc.
+
+**46 debt items across A's T-1..T-23 and B's D-CODEC-1..10 + D-STACK-
+1..13. 45 of them degrade either performance or correctness:**
+
+- A:T-1, T-2, T-7: correctness (already-fixed CodeRabbit findings) or
+  performance (SIMD-batched encode)
+- B:D-CODEC-1..10: correctness (cross-tier, RDO, stream framing) or
+  performance (no SIMD batch)
+- B:D-STACK-1..12: performance (block-size mismatch, SIMD lookup) or
+  correctness (sacred file, mandatory AVX-512)
+
+**One debt item — B:D-STACK-13 — degrades *confidence*:**
+
+> Without a single-binary four-loads benchmark, the entire architectural
+> claim is unproven. Every other debt item degrades performance or
+> correctness; this one degrades **confidence**. (B's original framing.)
+
+**Implication for sequencing.** Performance/correctness debt is
+incremental and recoverable; confidence debt is foundational and
+self-reinforcing. A single performance regression makes the codec
+slow; a single confidence gap makes every other resolution
+unverifiable. Plan G must precede A7 because:
+
+1. If A7's trait shape is wrong, fixing it after A7 ships is 4-8x
+   the cost of getting it right under bench pressure
+2. If the architectural claim is wrong, no amount of A7 perf work
+   makes it right
+3. "Two weeks of bench-harness work front-loaded saves six months of
+   trait-shape rework" — original B framing, preserved.
+
+**Plan G is the unfalsifiability gate.** Without it, M:H-1 through
+M:H-9 are claims. With it, they are demonstrably true or
+demonstrably false against the R-4 thresholds.
+
+**Cite as R-8 in Plan G's PR body; the framing belongs in the body,
+not buried in commit messages.**
+
+---
+
+### R-9 — `MergeDir` is topology-FREE, not just topology-generic
+
+**Problem.** Merge folds B:E1 into M:E-B (`trait CurveOrder`) but
+weakens the claim. M:E-B says "different curve, same kernel" — implies
+a curve still exists at the codec layer. B:E1's stronger claim: the
+4-way alphabet has *no spatial semantics at all* at the codec layer.
+
+**Resolution.** Pin the topology-free contract on `PredictiveSignal`.
+
+**The codec layer sees neighbours as `(slot_0, slot_1, slot_2, slot_3)`.
+Period.** No `MergeDir::North/East/West/South` semantic labels exist
+inside the codec. Consumers attach semantic labels *outside* the codec
+boundary.
+
+**`PredictiveSignal::neighbours` contract:**
+
+```rust
+pub trait PredictiveSignal {
+    /// Returns the 4 neighbour slots in implementation-defined order.
+    /// The codec NEVER interprets "slot 0" as "north" or any direction.
+    ///
+    /// Implementor semantic:
+    /// - cognitive: slot 0 = N, slot 1 = E, slot 2 = W, slot 3 = S
+    /// - splat:     slot 0 = prev-Morton, slot 1 = next-Morton,
+    ///              slot 2 = parent-LOD, slot 3 = child-LOD
+    /// - attention: slot 0 = prev-token, slot 1 = next-token,
+    ///              slot 2 = prev-head,  slot 3 = next-head
+    /// - gradient:  slot 0 = prev-iter,  slot 1 = next-iter,
+    ///              slot 2 = prev-layer, slot 3 = next-layer
+    ///
+    /// The codec writes `MergeDir = slot index (0..=3)`. Consumers
+    /// reinterpret on decode. No spatial semantic crosses the boundary.
+    fn neighbours(&self) -> [Option<Self::NeighbourRef<'_>>; 4];
+
+    type NeighbourRef<'a> where Self: 'a;
+}
+```
+
+**Implication for Plan I (PredictiveSignal trait, 3 days, Phase 0).**
+
+- The codec body never has "`if dir == North { ... }`" anywhere
+- The 4-slot neighbour array is treated as an opaque categorical
+- `MergeDir` enum becomes a *consumer-side* name for slot indices,
+  exposed via `mode.rs::pack_merge_dir(MergeDir) -> u8` but never used
+  in the predict / RDO / stream / rANS paths
+
+**Why this is stronger than M:E-B.** `CurveOrder` says "different curve,
+same kernel" — the curve is an attribute of the consumer's data layout.
+Topology-free goes further: even *with* a curve, the codec doesn't see
+it. The curve exists only in `nearest_basin` resolution (consumer-side)
+and `escape_vector_decode` (consumer-side).
+
+**Falsifies if.** Any codec-body code references slot 0 / 1 / 2 / 3 by
+semantic name (north / east / etc.). The grep for that pattern is the
+audit. Currently `predict.rs` does this in tests but never in code; the
+production path is already topology-free. Keep it that way through A6 /
+A7 / A8.
+
+**Cite as R-9 in Plan I PR description and in any future codec-body PR
+that touches `predict_intra`.**
+
+---
+
+### R-10 — Sub-1-bit/Gaussian math breakdown (from B:HG2)
+
+**Problem.** B:HG2 / M:H-6 claim sub-1-bit/Gaussian 3DGS compression.
+Neither my original nor the merge back-of-envelopes this. The claim
+floats without justification.
+
+**Resolution.** Commit the factor breakdown; mark sub-1-bit as
+*stretch*, ~4 bits/Gaussian as the *floor* (R-4 quality floor).
+
+**Stock 3DGS-PLY baseline (Inria trim):** ~50 bytes/Gaussian.
+
+**Factor 1: k-means palette mode coding (≈10×)**
+
+Most Gaussians in a trained scene cluster around a few hundred
+"archetype" (color, scale, opacity) tuples. After k-means basin
+assignment + Skip-heavy mode coding (flat regions all Skip):
+
+- Stock: 50 bytes/Gaussian = 400 bits
+- After mode coding: ~40 bits/Gaussian average (Skip=16, Merge=24,
+  Delta=24, Escape=48; with 60% Skip, 20% Merge, 15% Delta, 5% Escape):
+
+```text
+0.60 × 16 + 0.20 × 24 + 0.15 × 24 + 0.05 × 48 = 9.6 + 4.8 + 3.6 + 2.4 = 20.4 bits
+```
+
+After this factor: **~20 bits/Gaussian = 2.5 bytes/Gaussian = 20× over PLY.**
+
+**Factor 2: rANS entropy coding (≈3×)**
+
+Mode-distribution is heavy-tailed (60% Skip, 20% Merge, etc.). rANS
+entropy of that distribution:
+
+```text
+H = -(0.60 log₂ 0.60 + 0.20 log₂ 0.20 + 0.15 log₂ 0.15 + 0.05 log₂ 0.05)
+  = -(0.60 × -0.737 + 0.20 × -2.322 + 0.15 × -2.737 + 0.05 × -4.322)
+  = 0.442 + 0.464 + 0.411 + 0.216
+  = 1.533 bits per mode tag
+```
+
+Vs 2 bits flat for the mode tag. Savings on the mode field: 2 → 1.5 bits.
+Savings on the basin field (heavy-tail): 12 → ~6 bits. Savings on the
+8-bit delta (also heavy-tail): 8 → ~5 bits.
+
+Per-Gaussian average after rANS: ~7 bits.
+
+After this factor: **~7 bits/Gaussian = 5.7× over factor-1 = ~57× over PLY.**
+
+**Factor 3: SH-residual cross-LOD prediction (≈2×)**
+
+L=2 and L=3 SH coefficients are highly predictable from L=0 and L=1.
+A linear basis (R-1's `Basis<T>`) for SH spectral prediction reduces
+L=2/L=3 residuals to near-zero in flat regions. Skip-mode dominates
+SH ≥ L=2 coefficients in trained scenes.
+
+Per-Gaussian average after SH cross-prediction: ~4 bits.
+
+After this factor: **~4 bits/Gaussian = ~100× over PLY.**
+
+**Where the stretch comes from (sub-1-bit):**
+
+- **Factor 4a (≈2×)**: Per-asset codebook training (offline). Today
+  the basin codebook builds per-frame. For 3DGS, a single trained
+  scene = one asset = one codebook. Offline-trained codebooks
+  eliminate per-frame codebook overhead in the wire format. Gets to
+  ~2 bits/Gaussian.
+- **Factor 4b (≈2×)**: Higher-order rANS context modeling (CABAC-style
+  or tiny-transformer per A:E-9). Per-mode-given-neighbour-mode
+  probabilities are far more concentrated than per-mode marginals.
+  Gets to ~1 bit/Gaussian.
+- **Factor 4c (≈2×)**: Inter-frame coding for video-of-3DGS scenes
+  (Plan E2, post-MVP). Per-frame delta from previous frame's
+  reconstruction. Gets to ~0.5 bit/Gaussian.
+
+**Honest near-term target: ~4 bits/Gaussian (factor 1+2+3).** That's
+**100× over PLY trim, 12.5× over the R-4 floor of 30×.**
+
+**Stretch target: ~1 bit/Gaussian.** Requires factor 4a (offline
+codebook training, ~1 week) + 4b (CABAC-style context, ~2 weeks) =
+3 weeks beyond Plan E baseline.
+
+**Sub-1-bit target: ~0.5 bit/Gaussian.** Requires factor 4c (inter-frame
+coding) which is a Plan E2 or later.
+
+**Cite as R-10 in Plan E PR description and Plan G's `--mode splat`
+threshold doc.**
+
+---
+
+## 7. New commitments missing from both originals and from the merge
+
+### R-11 — Per-CTU encoder latency budget
+
+**Problem.** Neither doc nor the merge states ms-per-CTU at 60 fps 4K.
+Without it, B:D-CODEC-8 / A:T-7 (no SIMD-batched encode) have no
+falsifiability criterion.
+
+**Resolution.** Commit the budget; pin the SIMD-batched-encode debt
+to the budget.
+
+**4K @ 60 fps frame budget:**
+
+```text
+4K = 3840 × 2160 = 8.3 M pixels
+60 fps = 16.67 ms/frame
+At 8×8 leaf granularity (HEVC's smallest CU; the unit at which the
+encoder's inner-loop work is paid):
+                              132,710 leaves/frame
+                              (= 2,040 CTUs/frame at 64×64, × ~64
+                               leaves/CTU at maximum split depth;
+                               130,560 from clean 3840·2160/64, with
+                               ~1.6 % bias for chroma alignment)
+Per-leaf budget: 16.67 ms / 132,710 = 125 ns/leaf
+```
+
+**Encoder per-leaf breakdown (scalar reference, current):**
+
+| Stage | Scalar cost | SIMD-batched target |
+|-------|-------------|---------------------|
+| basin lookup (4096 entries, Hamming dist) | ~800 ns | ~50 ns (SIMD batched) |
+| mode decide (Skip → Merge → Delta → Escape) | ~80 ns | ~80 ns (already cheap) |
+| header pack (`pack_header`) | ~5 ns | ~5 ns |
+| transform (A4, 8×8 DCT-II butterfly) | ~30 ns | ~30 ns |
+| quantize (i8 round) | ~5 ns | ~5 ns |
+| rANS encode (A7) | ~40 ns | ~40 ns |
+| **Total per-leaf** | **~960 ns** | **~210 ns** |
+
+**At scalar reference (960 ns/leaf): 4K @ 60 fps requires 132,710 ×
+960 ns = 127 ms/frame. Misses 60 fps by 7.6×.**
+
+**At SIMD-batched (210 ns/leaf): 132,710 × 210 ns = 28 ms/frame. Misses
+60 fps by 1.7×; needs further work but in the same order of magnitude.**
+
+**To hit 60 fps 4K real-time** requires the SIMD-batched-encode path
+to land. **This pins B:D-CODEC-8 / A:T-7 from P2 to P1.** Plan A4-impl
+and Plan A6 should both ship with SIMD-batched paths, not scalar
+reference only.
+
+**Implication for Plan G.** The `--mode video` threshold (R-4)
+includes a latency assertion: total encode time for the Big Buck Bunny
+1080p clip must complete within (clip duration × 0.5). At 1080p that's
+~32,400 leaves/frame × 210 ns × 30 fps = ~204 ms/sec, well within
+budget. 4K is the stretch target.
+
+**Cite as R-11 in any encoder-path PR description; the latency
+budget is the gate that determines whether SIMD-batched encode is P0
+or P1.**
+
+---
+
+### R-12 — Streaming-buffer flush granularity
+
+**Problem.** Neither doc nor the merge says: per-CTU? per-frame? per-GOP?
+Different answers make Plan A8 substantially different shapes.
+
+**Resolution.** Commit per-CTU as the default; per-bucket for Plan F.
+
+**Per-CTU flush (committed default; CTU = 64×64 cells, so 4096 cells/CTU,
+2,040 CTUs/frame at 4K and ~510 CTUs/frame at 1080p):**
+
+```text
+Buffer size:   ~12 KB per CTU
+                 = 4096 cells × avg 3 bytes (mode-distribution per R-10)
+Flush rate:    ~122,400 flushes/sec at 4K 60 fps  (2,040 CTUs/frame × 60)
+               ~30,600 flushes/sec at 1080p 60 fps (510 CTUs/frame × 60)
+Latency:       sub-ms per CTU; consumer can start decoding the first
+               CTU before encoder finishes the frame
+```
+
+**Why per-CTU and not per-frame:**
+
+- per-frame buffer = ~1.5 MB; latency cost = 16.67 ms (one frame
+  latency added to encode-decode pipeline)
+- per-GOP buffer = ~25 MB at 16-frame GOP; latency = 267 ms,
+  unacceptable for live attention / KV-cache use cases
+- per-CTU = ~12 KB; latency = ~125 ns
+
+**Per-bucket override for Plan F (federated SGD):**
+
+```text
+Bucket = 4096 weights (one BlockedGrid L1 block of gradients)
+Buffer size: ~12 KB per bucket (same envelope as per-CTU)
+Flush rate:  per-iteration, per-bucket
+Latency:     bucket-local; all-reduce happens after bucket flush
+```
+
+**Wire format implication:** A8 frame header has a `FlushUnit` tag
+(2-bit field):
+
+```text
+FlushUnit::Ctu      → 00 (default, video / splat / attention)
+FlushUnit::Bucket   → 01 (gradient SGD)
+FlushUnit::Frame    → 10 (offline batch encode)
+FlushUnit::Reserved → 11
+```
+
+**Plan A8 implementation note:** Flush granularity lives in the frame
+header alongside `ConsumerProfile` (R-2) and the per-frame basin
+codebook ref. Stream readers route on `FlushUnit` for buffer
+allocation.
+
+**Cite as R-12 in Plan A8 PR description and Plan F PR description.**
+
+---
+
+### R-13 — Basin codebook distribution policy for Plan F
+
+**Problem.** Plan F is 2 weeks × 2 workers; the merge doesn't address
+whether the 4096-entry codebook is replicated across workers or
+partitioned. Either answer is fine; not deciding makes Plan F
+undefined.
+
+**Resolution.** Commit Option A (per-shard codebook) for Plan F v1;
+list alternatives as Phase 3 exploration.
+
+**Option A — Per-shard codebook (Plan F v1, committed):**
+
+```text
+Each worker holds 1 parameter shard, builds its own 4096-entry codebook
+over its shard, encodes its gradients against its own codebook.
+Wire format: each LeafCu carries (worker_id, basin_idx) in the per-frame
+escape vector lookup. No cross-worker comm during codebook build.
+
+Pro:  zero cross-worker codebook-build comm
+      worker independence
+      no global codebook drift
+Con:  loses cross-shard correlation (Merge-mode never fires across shards)
+      may compress worse than Option B by 1.5-2× per parameter
+```
+
+**Wire format extension for Option A:**
+
+```text
+Frame header (per worker, per iteration):
+  FlushUnit::Bucket
+  ConsumerProfile::Gradient
+  WorkerId: u16              ← NEW: per-shard codebook index
+  CodebookHash: u64          ← integrity check
+  rANS frequency table
+```
+
+**Option B — Replicated codebook (alternative, Phase 3):**
+
+```text
+One global 4096-entry codebook, all workers consume identical codebook.
+Cross-worker codebook-build comm: one all-reduce per epoch.
+
+Pro:  Merge-mode fires across shards (cross-parameter correlation)
+      better compression by 1.5-2×
+Con:  cross-worker codebook-build comm cost
+      codebook stale-ness if epoch boundary misses a parameter
+      complex resync after worker failure
+```
+
+**Option C — Hierarchical codebook (Phase 3+):**
+
+```text
+Per-shard codebook + global "override" codebook (256 entries) for the
+heavy-hitters that cross shards.
+LeafCu first checks global override; falls through to per-shard.
+
+Pro:  best compression in expectation (combines A and B)
+Con:  complex protocol; requires global hot-set tracking
+      worker-failure recovery non-trivial
+```
+
+**Plan F v1 commits Option A.** v2 (post-stability) evaluates Option B
+empirically; v3 (research-grade) tries Option C.
+
+**Falsifies if.** Option A on BERT-large fine-tune fails to clear the
+R-4 gradient threshold (8× compression at <0.5% loss delta). At that
+point, Plan F v1 escalates to Option B in a follow-up PR.
+
+**Cite as R-13 in Plan F PR description.**
+
+---
+
+## 8. The canonical contracts — concrete trait signatures
+
+All three plug-points (per M:E-E) get concrete signatures here. These
+are the contracts Plans G / H / I / A4-design commit to in Phase 0.
+
+```rust
+// ────────────────────────────────────────────────────────────────────
+// Plug-point 1: PredictiveSignal — what the consumer ships
+// ────────────────────────────────────────────────────────────────────
+
+/// Implemented by each domain's per-element data type:
+/// - cognitive cell `Fingerprint`
+/// - 3D Gaussian splat tuple
+/// - attention slot `(Q, K)` pair
+/// - gradient weight `(param_id, ∂L/∂w)`
+///
+/// Single trait surface; ~50 LoC per consumer impl.
+pub trait PredictiveSignal: Copy + Eq {
+    /// Basin codebook entry type. Often the same as `Self` (e.g.
+    /// cognitive: Fingerprint ↔ Fingerprint), but consumers like
+    /// gradient may use a tuple like `(GradientPattern, magnitude)`.
+    type Basin: Copy + Eq;
+
+    /// Residual after subtracting the nearest basin. Should fit
+    /// in i8 when "Delta-mode worthy".
+    type Residual: Copy;
+
+    /// What lives in the per-frame escape vector. Stock 3DGS:
+    /// `[f16; 48]` for SH≥L=2 + (μ, scale, rot, opacity, color).
+    /// Cognitive: `u64` Fingerprint.
+    /// Attention: `[f16; head_dim]`.
+    /// Gradient: `f32`.
+    type Escape: Copy;
+
+    /// Find the nearest basin in the codebook.
+    /// Returns (basin_idx, residual). basin_idx must be ≤ MAX_BASIN_IDX.
+    fn nearest_basin(&self, codebook: &[Self::Basin]) -> (u16, Self::Residual);
+
+    /// Is this residual small enough for Delta mode (fits i8)?
+    fn fits_delta(residual: &Self::Residual) -> bool;
+
+    /// Encode residual as the u8 byte that goes into the LeafCu.
+    fn pack_residual(residual: &Self::Residual) -> u8;
+
+    /// Type-erased neighbour reference (consumer-defined topology).
+    /// Codec NEVER interprets slot semantics — per R-9.
+    type NeighbourRef<'a>: Copy where Self: 'a;
+    fn neighbours(&self) -> [Option<Self::NeighbourRef<'_>>; 4];
+
+    /// Convert self into the escape payload (for Escape-mode encode).
+    fn to_escape(&self) -> Self::Escape;
+}
+
+// ────────────────────────────────────────────────────────────────────
+// Plug-point 2: Basis<T> + LinearReduce — per R-1
+// ────────────────────────────────────────────────────────────────────
+
+pub trait Basis<T: Copy> {
+    fn dim(&self) -> usize;
+    fn apply(&self, src: &[T], dst: &mut [T]);
+    fn invert(&self, src: &[T], dst: &mut [T]);
+}
+
+pub trait LinearReduce {
+    type Symbol: Copy;
+    type Output;
+    type Basis: Basis<Self::Symbol>;
+
+    fn reduce(&self, src: &[Self::Symbol], basis: &Self::Basis) -> Self::Output;
+    fn reduce_batch(
+        &self,
+        src: &[&[Self::Symbol]],
+        basis: &Self::Basis,
+    ) -> Vec<Self::Output>;
+}
+
+// Concrete impls (each ~30-80 LoC, lives in consumer crate):
+//   - IdentityBasis<T>           in ndarray-codec
+//   - DctIIBasis<const N>        in ndarray::hpc::fft
+//   - HadamardBasis<const N>     in ndarray::hpc::fft
+//   - AdamPrecondBasis           in burn-codec
+//   - KFACBlockBasis             in burn-codec
+//   - ShSpectralBasis<const L>   in ndarray::hpc::splat3d
+//
+//   - AlphaCompositeReduce       in ndarray::hpc::splat3d
+//   - RansEncodeReduce           in ndarray-codec::ans
+//   - SumReduce                  in ndarray-codec::reduce
+//   - SoftmaxReduce              in ndarray::hpc::activations
+
+// ────────────────────────────────────────────────────────────────────
+// Plug-point 3: CurveOrder — per M:E-B
+// ────────────────────────────────────────────────────────────────────
+
+/// Space-filling curve that linearises a multi-dim consumer payload
+/// into 1D for codec processing. The codec sees only the 1D stream.
+///
+/// Concrete impls (each ~20-40 LoC):
+///   - RasterScan<const W, const H>  for cognitive cells
+///   - MortonOrder<const D>          for 3DGS in 3D
+///   - HilbertOrder<const D>         for splat in 3D (alternative)
+///   - TokenSequence                 for attention
+///   - LayerSequence                 for gradient
+pub trait CurveOrder<const N: usize> {
+    /// Total points on the curve.
+    fn len(&self) -> usize;
+    /// (i+1)-th neighbour of point i along the curve, or None at boundary.
+    fn next(&self, i: usize) -> Option<usize>;
+    /// Per-point coordinate (in consumer's native dimensionality).
+    fn coord(&self, i: usize) -> [i32; N];
+}
+
+// ────────────────────────────────────────────────────────────────────
+// Plug-point 4 (lower priority, M:E new): RdoMetric
+// ────────────────────────────────────────────────────────────────────
+
+pub trait RdoMetric {
+    type Distortion: Copy + PartialOrd;
+    fn distortion(&self, reconstructed: &[u8], original: &[u8]) -> Self::Distortion;
+    fn rate(&self, bits_used: usize) -> f32;
+    fn cost(&self, d: Self::Distortion, r: f32, lambda: f32) -> f32;
+}
+
+// Concrete impls (consumer crate):
+//   - PsnrMetric       for video
+//   - SsimMetric       for splat
+//   - LossDeltaMetric  for gradient
+//   - KlDivergence     for attention
+```
+
+**The trait surface is the contract.** Plan I (3 days, Phase 0)
+implements `PredictiveSignal` for cognitive cells as the reference
+consumer. Plan A4-design (1 day) commits the `Basis<T>` + `LinearReduce`
+shapes. Plans D / E / F each ship one `impl PredictiveSignal for ...`
+plus their `CurveOrder` / `Basis` / `RdoMetric` impls.
+
+---
+
+## 9. Falsifiability matrix
+
+Every load-bearing claim from the canon and from this doc has a
+test, a metric, and a pass condition. The matrix is the audit
+that decides whether each holy-grail claim is demonstrated.
+
+| Claim | Source | Test | Metric | Pass condition |
+|-------|--------|------|--------|----------------|
+| M:H-1 / HG1 (4 loads → 1 codec) | canon | Plan G binary runs all 4 modes | 4 Lance columns emitted | All 4 emit successfully |
+| M:H-2 / H-2 (transform = optimizer) | canon | A4 + burn-codec ship | AdamPrecondBasis impls LinearReduce | Bench Adam-as-codec on BERT-glue |
+| M:H-3 / HG3 (bit-exact attention) | canon | Plan D ships | KV cache compresses + RULER score | ≥4× ratio, ≤0.5% accuracy loss |
+| M:H-4 / H-4 (Shannon-optimal grad) | canon | Plan F + signSGD bench | rANS frequency-table entropy match | Empirical entropy within 5% of H(p) |
+| M:H-5 / HG4 (ZeRO generalisation) | canon | Plan F + DeepSpeed bench | 8-16× compression at 16+ workers | ≥8× at ≤0.5% loss delta |
+| M:H-6 / HG2 (sub-1-bit/Gaussian) | canon + R-10 | Plan E + offline codebook | bits/Gaussian on Mip-NeRF 360 | Near: ≤4 bit; stretch: ≤1 bit |
+| M:H-7 / HG5 (Lance substrate) | canon | Plan H + Plan I land | 4-load Lance columns same schema | Schema check; per-load `read_codec_lance` |
+| M:H-8 / H-6 (64×64 universal) | canon | M:E-G `Ctu<const N>` | Compiles for N ∈ {16, 32, 64} | All 3 sizes pass codec tests |
+| M:H-9 / HG6 (splat3d × x265 one lib) | canon | Plan E + ndarray-codec | 1 binary, 1 dep tree | Binary size <10 MB; deps tree-clean |
+| M:H-NEW-1 (single binary, 4 loads) | canon | Plan G binary | `codec-bench --mode {video,splat,kv,grad}` | Executes each in <60s on ref data |
+| M:H-NEW-2 (~2 KLoC stack) | canon + R-3 | LoC audit per PR | Generic-codec LoC | <1500 LoC; per-consumer <200 LoC |
+| R-1 (LinearReduce shape correct) | this doc | Plan A7 builds against the trait | Trait isn't subclassed by A7 | A7 uses public trait surface only |
+| R-2 (bit 14 consumer-typed) | this doc | Plan A8 ships ConsumerProfile demux | All 4 profile decoders run | 4 profile-specific tests pass |
+| R-3 (LoC envelope) | this doc | LoC audit per PR | Cumulative generic LoC | <1500 after A4-A8 |
+| R-4 (Plan G thresholds) | this doc | `codec-bench --threshold` flag | Ratio + quality + LoC | All 4 thresholds clear |
+| R-5 (DCT crossover at 64) | this doc | A4-impl bench at varying N | Per-block vs batched dispatch time | Crossover within [32, 96] empirically |
+| R-6 (SSD via VNNI ≥30×) | this doc | `batched_ssd_search` micro-bench | Cycles per 16×16 ME candidate | ≤4 cycles per 256-d dot (VNNI) |
+| R-7 (tropical-GEMM ≥10×) | this doc | Plan A6 partition bench | Per-CTU partition RDO time | ≥10× over naive recursive on Zen 4 |
+| R-8 (Plan G is confidence gate) | this doc | Phase order | Plan G ships before A7 | A7 PR doesn't merge until Plan G binary green |
+| R-9 (topology-free) | this doc | grep audit | Codec body has no spatial-semantic refs | `grep -rE 'North\|East\|West\|South' src/hpc/codec/*.rs` returns only test/doc |
+| R-10 (4 bit/Gaussian floor) | this doc | Plan E bench | bits/Gaussian on Mip-NeRF 360 | ≤4 bits/Gaussian without offline codebook |
+| R-11 (4K 60fps SIMD-batched) | this doc | Plan G video latency assert | Per-CTU encode time | ≤210 ns/CTU on Sapphire Rapids |
+| R-12 (per-CTU flush) | this doc | A8 frame-header parse + decode | First-CTU latency | First CTU decodable before frame complete |
+| R-13 (Option A per-shard) | this doc | Plan F on BERT-glue | 8× compression + accuracy | Holds; else escalate to Option B |
+
+**Every row of this matrix is a test.** Plan G's bench harness binary
+emits a JSON report containing the actual measurement for each row;
+the merge job for Phase 2 consumer PRs reads that report and gates on
+pass-fail.
+
+---
+
+## 10. Sequencing diagram (canonical)
+
+```text
+                                     T+0
+            ┌─────────────────────────────────────────────┐
+            │            PHASE 0 — substrate gates         │
+            │                                             │
+            │   Plan H — extract ndarray-codec (3d)        │
+            │   Plan I — PredictiveSignal trait (3d)       │
+            │   Plan A4-design — Basis<T> + LinearReduce (1d)
+            │   Plan G — multi-domain bench (2w) ★ GATE    │
+            └─────────────────────┬───────────────────────┘
+                                  │
+                                  ▼
+                    Plan G binary green; thresholds testable
+                                  │
+                                  ▼
+        ╔════════════════════════════════════════════════╗
+        ║   Plan A7 — rANS  (1.5 w)  CRITICAL PATH       ║
+        ║   gates on Plan G; ships against R-1 trait     ║
+        ╚════════════════════════╦═══════════════════════╝
+                                 │
+              ┌──────┬───────┬──┴───┬──────┬──────────┐
+              ▼      ▼       ▼      ▼      ▼          ▼
+        Plan B   Plan A4  Plan A6  Plan A8  Plan C   (R-11 SIMD
+        (inter) (impl)   (RDO)   (stream) (EWA)      batch path
+        1 wk    1 wk     1 wk    1 wk     1 wk       lands in
+              └──────┴───────┴──┬───┴──────┘         each)
+                                ▼
+              ┌─────────────────────────────────┐
+              │  PHASE 1 closes — codec mech    │
+              │  complete; thresholds re-run    │
+              └────────────────┬────────────────┘
+                               │
+              ┌────────┬───────┴────────┐
+              ▼        ▼                ▼
+          Plan E   Plan D           Plan F (after D)
+          (3DGS    (attention       (federated SGD,
+          3 wk×2)   2 wk×2)          4 wk×2, R-13)
+              │        │                │
+              └────────┴──────┬─────────┘
+                              ▼
+                    Plan G runs all 4 thresholds
+                              │
+                              ▼
+                    HG1 / HG6 / M:H-NEW-1 demonstrated
+                    (or specific claims falsified)
+```
+
+★ **Gate semantics.** Plan G is a *blocking* gate: Plan A7 cannot
+merge until Plan G's bench-harness binary is green (i.e., runs all 4
+modes end-to-end, even if compression ratios are below threshold —
+those calibrate in Phase 1). The threshold pass-fail bind on Phase 2
+consumer PRs, not on Phase 1 codec PRs.
+
+---
+
+## 11. End-state recap and exit conditions
+
+**The end state, recapped from §1:**
+
+After ~10.5 weeks of trajectory work:
+
+1. One binary `codec-bench` runs four modes end-to-end (HG1 demonstrated).
+2. Generic codec LoC ≤1500 (R-3 / M:H-NEW-2 demonstrated).
+3. Each consumer ≤200 LoC of trait impl (R-3 demonstrated).
+4. Compression ratios meet R-4 thresholds for all four loads:
+   - Video: ≥0.95× x265 ultrafast at parity PSNR
+   - Splat: ≥30× over Inria PLY-trim at SSIM parity
+   - KV cache: ≥4× over FP16 raw at ≤0.5% RULER loss
+   - Gradient: ≥8× over signSGD at ≤0.5% loss delta
+5. `ndarray-codec` crate extracted (M:E-D / Plan H demonstrated).
+6. Three traits land at type-erased boundaries (Plan I + A4-design).
+7. CLAUDE.md "Architecture Rule" lists 5 categories (M:T-3 closed).
+
+**Exit conditions per claim:**
+
+- **M:H-1 met** when `codec-bench --mode {video, splat, kv, grad}`
+  emits 4 Lance columns within the LoC envelope.
+- **M:H-2 met** when AdamPrecondBasis impl ships in burn-codec and
+  reduces BERT-glue training to within 5% of stock Adam loss curve
+  using the same `LinearReduce` trait surface as A7 rANS.
+- **M:H-3 met** when Plan D ships and Llama-3 inference on RULER 64K
+  passes R-4 threshold.
+- **M:H-4 met** when Plan F + signSGD bench shows rANS frequency-table
+  entropy within 5% of empirical H(p) for ≥3 layer types.
+- **M:H-5 met** when 16-worker BERT fine-tune via Plan F clears 8×
+  compression at ≤0.5% loss delta on GLUE.
+- **M:H-6 met** when Mip-NeRF 360 garden scene compresses to ≤4
+  bits/Gaussian (near target per R-10).
+- **M:H-7 met** when each of the 4 loads writes to a Lance column
+  with identical schema (one read path serves all).
+- **M:H-8 met** when `Ctu<16>`, `Ctu<32>`, `Ctu<64>` all pass codec
+  tests in a single build.
+- **M:H-9 met** when one binary `codec-bench` ships <10 MB with all
+  4 modes wired (no `--feature splat` gating; everything compiled in).
+
+**If any claim fails its exit condition**, the corresponding consumer
+PR scopes down (e.g., M:H-6 stretch sub-1-bit/Gaussian fails → ship
+4-bit/Gaussian as near-term reality, mark sub-1-bit as Plan E2). The
+falsifiability is the point; not every claim has to hold for the
+architecture to be valuable.
+
+**Worst-case fallback.** If even Plan G itself fails (e.g., trait
+shape from R-1 turns out wrong under bench pressure), the canon's
+M:E-A factoring is wrong. Recovery: revise traits before A7 ships.
+Cost: ~3 days extra in Phase 0. **This is exactly why Plan G goes
+first — to catch this before A7 wastes 1.5 weeks against the wrong
+shape.**
+
+---
+
+## 12. Compaction-preservation contract
+
+When this doc is summarised across context windows, preserve at minimum:
+
+1. **The five "merged well" items** from §3 (M:E-A / M:E-D / M:E-G /
+   M:E-I / M:E-F) — these are the canon's load-bearing synthesis.
+2. **The thirteen resolutions** R-1 through R-13 with their summary
+   commitments:
+   - R-1: `LinearReduce<Basis>` two-trait shape
+   - R-2: bit 15 universal, bit 14 consumer-typed
+   - R-3: ≤1500 LoC generic, ≤200 LoC per consumer
+   - R-4: 4 threshold pairs (video, splat, kv, grad)
+   - R-5: DCT crossover ~64 blocks
+   - R-6: SSD via VNNI ≥30× over SAD
+   - R-7: tropical-GEMM partition O(4^d) → O(d²)
+   - R-8: Plan G is confidence gate
+   - R-9: topology-FREE codec layer
+   - R-10: ~4 bits/Gaussian near target, ~1 bit stretch
+   - R-11: 210 ns/CTU SIMD-batched encode
+   - R-12: per-CTU flush default; per-bucket Plan F
+   - R-13: Option A (per-shard codebook) for Plan F v1
+3. **The trajectory** from §2 — Phase 0 → A7 → parallelise → Phase 2
+4. **The five-category architecture** including `ndarray-codec`
+5. **The four traits** as the canonical contracts:
+   `PredictiveSignal`, `Basis<T>`, `LinearReduce`, `CurveOrder<const N>`
+   (plus `RdoMetric` for A6)
+6. **Plan G as the gate** — A7 cannot merge until Plan G binary green
+7. **The falsifiability matrix in §9** — every claim has a test;
+   not every claim will pass; that's the design
+
+**Citation IDs in this doc** (R-1 .. R-13) are stable. Canon IDs
+(M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E-*, A:H-*, A:T-*, B:E-*, B:HG-*,
+B:D-*) remain stable per canon's §10. Append, never renumber.
+
+---
+
+## 13. The single load-bearing paragraph
+
+If you read nothing else:
+
+> *The merged canon committed to the right architectural synthesis
+> (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts
+> unsigned. This doc commits them: `Basis<T>` + `LinearReduce` are
+> two traits not one (R-1); bit 14 of the leaf header is consumer-
+> typed and bit 15 universal (R-2); generic codec body ≤1500 LoC
+> with ≤200 LoC per consumer (R-3); four threshold pairs gate
+> Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) →
+> Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel
+> (3 wks); end state is one binary, four loads, ~2 KLoC stack
+> demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim
+> in §9 has a test; Plan G's bench-harness binary is the audit. The
+> falsifiability is the point.*
+
+---
+
+_Last edit: 2026-05-22 — companion to merged canon `bc9da4ad`.
+Edit when an R-N resolves to ship, when a falsifiability test pin
+shifts, or when an exit condition closes. Renumber only by appending._
diff --git a/.claude/knowledge/pr-x12-substrate-merged-canon.md b/.claude/knowledge/pr-x12-substrate-merged-canon.md
index b6948cca..81a20b67 100644
--- a/.claude/knowledge/pr-x12-substrate-merged-canon.md
+++ b/.claude/knowledge/pr-x12-substrate-merged-canon.md
@@ -13,6 +13,24 @@
 
 Two independent sessions reached the same architectural claim — *PR-X12 is the universal predictive-coder substrate that subsumes four industries* — through different routes. Each session surfaced angles the other missed. This doc is the **canonical fusion**, designed to be the single doc a fresh agent reads to inherit the entire claim.
 
+> **Post-merge resolutions index** (2026-05-22): the claims and tensions in this doc were further formalised into 13 numbered resolutions R-1..R-13. See `pr-x12-canon-resolutions-delta.md` for the canonical list. Cross-section pointers inline below:
+>
+> - §M:E-A (Mode-decide + reduce pipeline kernel) → **R-1** (`LinearReduce<T>` + `Basis<T>` trait split)
+> - §M:E-G (`Ctu<const N>`) and §M:E-J (header bits 14-15) → **R-2** (16-bit header layout pinned), **R-8** (Plan G arch-conditional gate)
+> - §M:E-H (D-STACK-13 bench harness as P0) → **R-4** (codec-bench in Plan G), **R-11** (latency assertions per arch)
+> - §M:H-NEW-2 (codec body LoC envelope ≤ 1500) → **R-3** (LoC audit rule, scope-fence definition)
+> - §M:H-6 (sub-1-bit basin + Gaussian-tail rANS) → **R-10** (commitment to sub-1-bit-per-token where source supports it)
+> - §M:E-D (codec breaks ndarray ↔ lance-graph cycle) → **R-7** (tropical-GEMM lives in lance-graph, called from codec — dep direction allowed)
+>
+> Perspective companions written 2026-05-22:
+> - `pr-x12-x265-blasgraph-gemm.md` — every codec inner loop as a GEMM
+> - `pr-x12-x266-3dgs-spacetime-upscaling.md` — Basis<T> + EWA splat → free space-time codec upscaling
+> - `pr-x12-woa-multiarch-orchestration.md` — how WoA / q2 / consumer crates inherit the substrate's per-arch dispatch
+> - `pr-x12-anti-neural-lookup-inversion.md` — lookup tables as frozen 1-layer NNs; the codec is the anti-neural codec
+> - `pr-x12-gguf-llm-weights-encoding.md` — the fifth load: GGUF attention/FFN tensors as Skip/Merge/Delta/Escape
+> - `pr-x12-bgz-jc-substrate-synergies.md` — **CRITICAL**: the PR-X12 substrate is *already implemented* in `lance-graph/crates/{bgz17,highheelbgz,bgz-tensor}`, formally proven in `lance-graph/crates/jc`. Skip/Merge/Delta/Escape ≡ Scent/Palette/ZeckBF17/Full. 4096-entry basin ≡ HHTL 16×16×16 lattice. bgz-hhtl-d ships LLM weight encoding at 343:1 on Qwen3-TTS-1.7B today. Two gaps identified: `jd-nd` (ndarray-side proof crate) and Cronbach/ICC encoding-reliability research crate.
+> - `pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md` — **substrate bindings**: cam_pq trains all bgz palettes (CAM bytes map onto HHTL bits 1:1); sigker provides Chen-Lyons signature uniqueness (arXiv:2006.14794, Hambly-Lyons 2010, CST 2021) as the formal-correctness bedrock cited by jc Pillar 11 (DEFERRED); dn_tree + merkle_tree are the online-update + integrity substrate for R-13 SharedClusterWide. **Seven new gaps catalogued (G-1..G-7), ~11-17 weeks of wiring** to fully bind. R-14 (formal correctness) + R-15 (signature basis) candidates surfaced.
+
 The merge is not a re-statement. **It is the new epiphanies that emerge only when both halves sit side by side.** They get their own §3.
 
 ### Identity-preservation rules
@@ -122,6 +140,8 @@ These are the insights that emerge **only when both docs sit next to each other*
 
 ### M:E-A — Mode-decide + reduce IS the universal pipeline kernel
 
+> [Formalised post-merge as **R-1**: `LinearReduce<T>` decomposes into `Basis<T>` (basis-as-data) + `Reducer<T>` (reduction operator). See `pr-x12-canon-resolutions-delta.md` §R-1.]
+
 A's E-4 (transform IS optimizer preconditioner) + B's E9 (splat3d × codec = same pipeline shifted 90°) combined:
 
 The *reduction operator* in B's "unified mode-decide+reduce trait" is **exactly the basis-times-source product** A's transform claim points at:
@@ -258,6 +278,19 @@ pub trait PredictiveSignal {
 
 ### M:E-J — The reserved header bits 14-15 carry causal-edge metadata for free
 
+> [Formalised post-merge as **R-2**: 16-bit header bit layout pinned —
+> bits 0-1 = `header_kind`, bits 2-13 = `basin_index`,
+> **bit 15 = UNIVERSAL "has inter-tier reference"** (identical across
+> all four consumers; A3-inter cross-tier link),
+> **bit 14 = CONSUMER-TYPED via the frame header's `ConsumerProfile`
+> tag** (cognitive: Pearl-rung high bit; video: reserved=0;
+> splat: LOD-cascade-source flag; gradient: worker-shard parity).
+> Leaf size (8/16/32/64) is encoded structurally via M:E-G's
+> `Ctu<const N>` at the type level, NOT in header bits 14-15. The
+> causal-tier reading below is the historical motivation for bit 14;
+> R-2 generalises it to the four-consumer demux. See
+> `pr-x12-substrate-canon-resolutions.md` §R-2.]
+
 A's E-15 (reserved bits 14-15 are inter-tier link) + A's T-22 (causal-edge v2 mantissa: Intervention=+6, Counterfactual=-6):
 
 Two reserved bits = 4 states. The natural 4-state encoding for cognitive content:
@@ -290,6 +323,8 @@ Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the
 
 **M:H-6** *(from B:HG2 alone)* — Sub-1-bit-per-Gaussian 3DGS compression. 30-60× over current state-of-the-art PLY-trim. A 1M-Gaussian scene = ~500 KB, streamable as video. **Most economically valuable single claim** — directly attacks the bandwidth bottleneck for cloud-rendered 3D content.
 
+> [Formalised post-merge as **R-10**: PR-X12 commits to sub-1-bit-per-token via Gaussian-tail rANS where the source distribution supports it (basin codebook + heavy-tailed residual). See `pr-x12-canon-resolutions-delta.md` §R-10 for the falsification path (Plan G entropy bench).]
+
 **M:H-7** *(merge of A:H-1 + B:HG5)* — Lance column substrate identity becomes ground truth. `SpoDistanceMatrices` at 611M lookups/sec serves as universal palette codebook lookup across all four loads. ndarray = hardware, ndarray-codec = compression substrate (new, per M:E-D), lance-graph = thinking, causal-edge = protocol, p64 = convergence. Five-category architecture.
 
 **M:H-8** *(from A:H-6 alone)* — 64×64 CTU is the right unit for both 4K video luma blocks and 7B-parameter LLM head dim × 16 heads. Convergent evolution from two unrelated industries arriving at the same arithmetic block size.
@@ -302,6 +337,8 @@ Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the
 
 **M:H-NEW-2** — `trait PredictiveSignal` + `trait LinearReduce<Basis>` + `trait CurveOrder<const N: usize>` factor the codec into three plug-points (per M:E-E + M:E-A + M:E-B). The codec body is `<150 LoC of generic glue. Domain consumers ship `<200 LoC` of trait impls. **Total stack for all four industries: ~2 KLoC.** Compared to ~50 KLoC per-domain implementations elsewhere. The 25× code-density delta is the architectural payoff that justifies the eight sub-cards.
 
+> [Formalised post-merge as **R-3**: the LoC envelope is `≤ 1500 lines of generic codec body` (revised upward from `<150` for realism after counting glue), enforced via an explicit scope-fence audit rule in CI. The substrate (`ndarray::hpc::blas_level2` etc.) is excluded from the budget. See `pr-x12-canon-resolutions-delta.md` §R-3 for the exact audit definition.]
+
 ---
 
 ## 5. Unified integration plan (canonical sequencing)
diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md
new file mode 100644
index 00000000..0da19ed7
--- /dev/null
+++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md
@@ -0,0 +1,345 @@
+# PR-X12 — WoA Orchestration & Multi-Arch Dispatch Lens
+
+> Date: 2026-05-22
+> Status: **perspective doc** — examines how the orchestration crates (`woa-rs`, `woa`, `q2`, `surrealdb`, `MedCare-rs`, `smb-office-rs`) consume the PR-X12 substrate, and how PR-X12's per-arch dispatch decisions (R-4, R-5, R-11) generalise to the entire HPC stack.
+>
+> Premise: PR-X12 is not just a codec project. It's the **per-arch dispatch contract** that every consumer above `ndarray` will inherit. The codec is the first non-trivial test of whether that contract holds.
+
+---
+
+## 0. Thesis
+
+**Every consumer crate dispatches kernels across {Intel SPR, AMD Zen 4-5, ARM Graviton 3-4, Apple Silicon, NVIDIA Hopper-Blackwell} via the same `ndarray::hpc` capability traits.** PR-X12's per-arch DCT crossover (R-5) and latency assertion (R-11) aren't codec-specific — they're the canonical shape of how any consumer crate gates fast-paths. If the codec's per-arch story is wrong, the entire HPC consumer ecosystem inherits the bug.
+
+---
+
+## 1. The orchestration problem PR-X12 must solve
+
+In a real deployment, a `woa-rs` agent processing a request might:
+
+1. Receive a video stream (codec: PR-X12)
+2. Run perception model on extracted frames (`burn`/`candle`)
+3. Query graph state (`lance-graph::blasgraph` tropical-GEMM)
+4. Update node-local cache (`surrealdb`)
+5. Emit response stream (codec again)
+
+Steps 1, 2, 3, 5 all hit the `ndarray::hpc` BLAS layer. Each step has a per-arch fast-path: SPR uses AMX, Zen 4 uses VNNI+AVX-512, Graviton 3 uses SVE2, Apple uses NEON/AMX, Hopper uses tensor cores. **None of the consumer crates know which fast-path is active.** They call `blas_level2::batched_gemm` and the substrate dispatches.
+
+This is what makes PR-X12's R-4 / R-11 architecture-conditional bench gates *substrate policy*, not codec policy. R-4 says "Plan G clears at most on 1 of: SPR / Zen 4 / Graviton 3 / Apple M-class," and R-11 adds latency assertions. That same gate structure applies to:
+
+- `burn` model serving (forward pass per arch)
+- `candle` quantized inference (q4/q8 per arch)
+- `lance-graph::blasgraph` graph queries (tropical-GEMM per arch)
+- `surrealdb` HNSW search (vector dist per arch)
+- `MedCare-rs` DICOM transform (DCT + wavelet per arch)
+- `smb-office-rs` OCR + layout (conv + attention per arch)
+
+Every one of these inherits the dispatch contract. PR-X12 is the first to make it visible.
+
+---
+
+## 2. WoA's place in the stack
+
+```text
+┌────────────────────────────────────────────────────┐
+│ WoA agent (woa-rs, woa)                            │
+│   Request orchestration, scheduling, transport     │
+└────────────────────┬───────────────────────────────┘
+                     │ async dispatch, no SIMD
+                     ▼
+┌────────────────────────────────────────────────────┐
+│ Domain consumer crates                             │
+│   ndarray-codec, burn, candle, lance-graph,        │
+│   surrealdb, MedCare-rs, smb-office-rs             │
+│   (Each: ~1-5K LoC of generic code + traits)       │
+└────────────────────┬───────────────────────────────┘
+                     │ capability traits, target_feature
+                     ▼
+┌────────────────────────────────────────────────────┐
+│ ndarray::hpc (the dispatch substrate)              │
+│   blas_level{1,2,3}, fft, cam_pq, activations,     │
+│   simd_int_ops, bf16_tile_gemm                     │
+│   (~15K LoC; PR-X12 ratchets at this layer)        │
+└────────────────────┬───────────────────────────────┘
+                     │ per-arch SIMD intrinsics
+                     ▼
+┌────────────────────────────────────────────────────┐
+│ Hardware: SPR / Zen / Graviton / Apple / Hopper    │
+└────────────────────────────────────────────────────┘
+```
+
+**WoA never touches `target_feature` directly.** Its job is async scheduling, transport (Q2 over QUIC), persistence (surrealdb), and policy. The SIMD dispatch happens one layer below, in the consumer crates calling `ndarray::hpc`.
+
+This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tractable. The codec crate doesn't dispatch — it calls the substrate. WoA doesn't dispatch — it calls the codec, which calls the substrate. Per-arch code lives once, in `ndarray::hpc`.
+
+---
+
+## 3. Per-arch dispatch as a substrate property
+
+The PR-X12 substrate (per merged canon §M:E-G, §M:E-H, R-4, R-5, R-11) implements per-arch dispatch via three mechanisms:
+
+### 3.1 Compile-time `target_feature`
+
+```rust
+// In ndarray::hpc::blas_level2::batched_gemm:
+
+#[cfg(target_arch = "x86_64")]
+mod x86_dispatch {
+    #[target_feature(enable = "avx512f,avx512bw,avx512vnni")]
+    pub unsafe fn batched_gemm_vnni(...) { /* VNNI path */ }
+
+    #[target_feature(enable = "amx-tile,amx-int8,amx-bf16")]
+    pub unsafe fn batched_gemm_amx(...) { /* AMX path */ }
+}
+
+#[cfg(target_arch = "aarch64")]
+mod arm_dispatch {
+    #[target_feature(enable = "sve2")]
+    pub unsafe fn batched_gemm_sve2(...) { /* SVE2 path */ }
+
+    #[target_feature(enable = "neon,fp16")]
+    pub unsafe fn batched_gemm_neon_fp16(...) { /* Apple Silicon */ }
+}
+```
+
+### 3.2 Runtime feature detection (cached at process start)
+
+```rust
+// In ndarray::hpc::capability:
+pub static CAP: OnceLock<HwCaps> = OnceLock::new();
+
+pub struct HwCaps {
+    pub has_amx: bool,
+    pub has_vnni: bool,
+    pub has_sve2: bool,
+    pub has_neon_fp16: bool,
+    pub l1_cache_size: usize,
+    pub vec_width_bits: u16,
+    // ... more as new features land
+}
+
+pub fn batched_gemm(input: ...) {
+    let caps = CAP.get().unwrap();
+    if caps.has_amx { unsafe { batched_gemm_amx(input) } }
+    else if caps.has_vnni { unsafe { batched_gemm_vnni(input) } }
+    else if caps.has_sve2 { unsafe { batched_gemm_sve2(input) } }
+    // ...
+    else { batched_gemm_scalar(input) }
+}
+```
+
+### 3.3 Per-arch tunable crossover (R-5 generalised)
+
+Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch:
+
+```rust
+const DCT_BATCH_CROSSOVER: usize = match Arch::CURRENT {
+    Arch::SapphireRapids => 64,   // AMX wins above this
+    Arch::IceLakeServer => 32,    // AVX-512 narrower; lower crossover
+    Arch::Zen4 => 96,             // Zen's AVX-512 emulation widens crossover
+    Arch::AppleM3 => 256,         // NEON's narrower; only worth at large N
+    Arch::GravitonV3 => 128,      // SVE2 mid-range
+    Arch::Generic => usize::MAX,  // Always scalar fallback
+};
+
+pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
+    if N >= DCT_BATCH_CROSSOVER {
+        unsafe { dct_gemm_path(input, output) }
+    } else {
+        dct_butterfly_path(input, output)
+    }
+}
+```
+
+R-5 commits these crossovers as **bench-tunable constants**, not hand-guessed numbers. Plan G's codec-bench includes a calibration sub-target that emits the right `const` values per arch via build script.
+
+---
+
+## 4. The latency budget split — codec / orchestration / network
+
+A WoA agent processing a video stream end-to-end has three latency contributors:
+
+```text
+Total latency  =  T_codec  +  T_orchestration  +  T_network
+```
+
+PR-X12 (R-11) commits a budget on `T_codec`:
+
+| Stage | Budget (per encode) | Source |
+|---|---|---|
+| Codec encode | ≤ 0.5× wall-clock for 1080p @ 30 fps | R-11 |
+| Codec decode | ≤ 0.25× wall-clock for 1080p @ 30 fps | R-11 |
+| Block-level ME | ≤ 10 µs per CTU on SPR | R-11 spec, calibrated by codec-bench |
+| Tropical-GEMM RDO | ≤ 50 µs per CTU on SPR | derived from R-7 cost analysis |
+| Basis::apply (DCT) | ≤ 2 µs per 32×32 block on SPR | derived from R-5 |
+
+**WoA's contract:** if any of these are violated on a supported arch, the consumer can either accept the slowdown or refuse to schedule the request. WoA has visibility into per-arch dispatch quality via the substrate's metrics endpoint:
+
+```rust
+ndarray::hpc::metrics::stage_latency_p99(stage: StageId) -> Duration;
+```
+
+This is wired through to woa-rs's request scheduler. If `T_codec` p99 exceeds budget, woa-rs can:
+
+- Reroute to a different node (better hardware available)
+- Degrade the request (lower codec quality, smaller batch)
+- Fail fast with a clear error (don't tie up the client)
+
+**Without R-11's commitment to latency assertions in CI, this whole chain falls over.** The substrate-to-orchestrator contract is empty unless someone ratchets on it.
+
+---
+
+## 5. Federated codebook policy (R-13) — the orchestration angle
+
+R-13 commits that the codec's 4096-entry basin codebook can be either:
+
+- **Per-instance** (each PR-X12 encoder builds its own from the input stream)
+- **Federated** (a cluster of encoders shares a codebook, periodically updated)
+- **Per-domain pretrained** (a hand-curated codebook ships with the binary for {video, text, image, audio} domain segments)
+
+The orchestration layer (WoA / Q2) is where federation policy lives. Specifically:
+
+```rust
+// In q2 (transport / coordination):
+pub enum CodebookPolicy {
+    LocalEphemeral,                    // each encoder owns its codebook
+    SharedClusterWide { ttl: Duration }, // gossip protocol distributes
+    SharedRegional { region: Region },   // edge-tier sharing
+    PretrainedStatic { id: BlobId },     // immutable, served from CAS
+}
+
+impl WoaAgent {
+    fn select_codebook(&self, request: &Request) -> CodebookHandle {
+        match request.payload_class() {
+            PayloadClass::HumanText => self.pretrained("english-text-v3"),
+            PayloadClass::VideoFrame => self.shared_cluster_wide(),
+            PayloadClass::EphemeralBlob => self.local_ephemeral(),
+            // ...
+        }
+    }
+}
+```
+
+**R-13 says:** the codec layer exposes the basin-codebook as a swappable handle. The orchestration layer chooses which codebook to use per request. PR-X12 ships with the substrate hook; q2 owns the policy.
+
+**Why this matters for PR-X12 scope:** the basin-codebook is currently a hard-coded 4096-entry array per encoder. R-13 commits to making it swappable (replacing the array reference with a handle/trait) — this is a ~30-line change in the codec crate, not a 300-line rewrite. The federation logic itself lives in q2, outside PR-X12's body.
+
+This is a model for many features that look "out of scope" for PR-X12 but actually need a tiny anchor in PR-X12 to be reachable later:
+
+- Federated codebook → swap pointer to handle (R-13)
+- 3DGS scene anchor → add SceneAnchor header_kind (x266 doc)
+- GPU offload → add `Reducer::dispatch_target() -> DispatchTarget` (Plan E adjacent)
+- Speculative decode → add `Frame::is_speculative()` bit in header reserved field
+
+None of these are PR-X12 scope. All of them require ≤50 LoC of "anchor" in PR-X12. The discipline of M:H-NEW-2 + R-3's LoC envelope is what makes future anchoring possible without forking the codec.
+
+---
+
+## 6. Cross-arch determinism — the consumer's hardest requirement
+
+A WoA agent that runs on SPR in the data center and Apple Silicon at the edge must produce **the same answer** for the same input. Floating-point order-of-operations differs across SIMD widths, so naive parallel reductions break this.
+
+PR-X12's `LinearReduce<T>` abstraction (R-1, M:E-A) is the answer:
+
+```rust
+pub trait Reducer<T> {
+    fn reduce_pair(&self, lhs: T, rhs: T) -> T;
+}
+
+// For bit-exact reduction across archs:
+pub struct OrderedKahanReducer;
+
+impl Reducer<f32> for OrderedKahanReducer {
+    fn reduce_pair(&self, lhs: f32, rhs: f32) -> f32 {
+        // Kahan compensated sum, with explicit left-to-right order
+        // Same bit pattern on every arch
+        kahan_add(lhs, rhs)
+    }
+}
+```
+
+The codec uses `OrderedKahanReducer` for any sum that crosses a wire-format boundary — basin assignment, rate-distortion accumulation, transform coefficient sum. Same input → same bits, regardless of arch. Determinism is paid for in some throughput (Kahan is ~3× slower than naive sum), but it's a tunable choice per use site.
+
+**Without R-1's basis/reducer split, cross-arch determinism is a substrate-wide audit nightmare.** With it, the audit is per-use-site: grep for places that use `NaiveSimdReducer` on cross-wire-format paths, replace with `OrderedKahanReducer`.
+
+---
+
+## 7. Failure modes and mitigations
+
+### 7.1 ABI drift between substrate and consumer
+
+If `ndarray::hpc::blas_level2::batched_gemm`'s signature changes, every consumer breaks. **Mitigation:** R-3's LoC envelope explicitly excludes the substrate API from "codec body LoC" — meaning the API gets the same review scrutiny as a public crate API. SemVer applies.
+
+### 7.2 Per-arch CI flake
+
+R-4 commits codec-bench gates the merge on at most 1 arch. **Mitigation:** the bench passes on the canonical arch (SPR), and the other arches are *informational* on each PR but blocking on release tag. This is the standard "fast PR / slow release" gate pattern.
+
+### 7.3 Version skew across the WoA fleet
+
+A cluster running mixed PR-X12 versions could produce inconsistent codec output. **Mitigation:** the wire format header includes a version byte (one of M:E-J's reserved bits in future revisions); decoder rejects incompatible streams with a clean error. The federation gossip in q2 propagates the codec version as part of the node descriptor.
+
+### 7.4 Federated codebook poisoning
+
+If R-13's federated codebook is updated by a compromised node, the cluster compresses badly. **Mitigation:** codebook updates are signed; q2 ignores updates not signed by quorum. Out of PR-X12 scope (it's a transport/auth concern) but the substrate exposes the hook.
+
+---
+
+## 8. The consumer crates in detail
+
+Quick tour of what each crate inherits from PR-X12 substrate decisions:
+
+### 8.1 `burn` (model training/inference)
+
+Uses `blas_level3::gemm` for matrix multiply, `activations` for nonlinearities, `cam_pq` for KV cache compression. Per-arch dispatch via the same target_feature paths. Will benefit directly from PR-X12's R-4 / R-11 latency-assertion infrastructure when it lands (burn has wanted this for ~14 months).
+
+### 8.2 `candle` (quantized inference)
+
+Heavy user of `simd_int_ops` and `bf16_tile_gemm`. Most-affected consumer by R-5's per-arch crossover constants, because candle's q4/q8 paths have similar crossover decisions. Will likely adopt the same crossover-as-const pattern within the next 1-2 quarters.
+
+### 8.3 `lance-graph::blasgraph` (graph queries)
+
+Owner of tropical-GEMM (R-7); the codec is a consumer, not an owner, of that kernel. PR-X12's allowed dependency direction (`ndarray-codec → lance-graph::blasgraph`) was confirmed under R-7 only after careful audit; previously lance-graph could only consume `ndarray`, not be consumed by sibling crates. M:E-H clarifies this dep direction is fine because both crates sit above ndarray and below woa/q2.
+
+### 8.4 `surrealdb` (vector + relational DB)
+
+Uses `cam_pq::hnsw_search` for vector lookups, `simd_int_ops` for filter expressions. Will inherit R-13's federated-codebook pattern for its own quantized vector indexes (long-discussed, not scheduled).
+
+### 8.5 `MedCare-rs` (medical imaging)
+
+The doc most likely to drive R-1's basis trait to its limits — medical imaging uses DCT, DWT (wavelet), and 3D radon transforms, all of which want to be `Basis<T>` impls. Provides the second non-trivial test of the basis trait after PR-X12 ships. Federated-codebook policy (R-13) is *required* for medical imaging because PHI rules prohibit per-instance codebooks leaking patient-specific symbol distributions.
+
+### 8.6 `smb-office-rs` (office document OCR)
+
+Heavy user of conv (`activations::conv2d`) and attention (within `burn`-backed models). Less affected by PR-X12's specific reservations; more affected by R-11's latency assertions, because office OCR is latency-sensitive for interactive use cases.
+
+### 8.7 `q2` (transport / coordination)
+
+Owns the federation policy (R-13), the codec version negotiation, and the per-arch capability gossip. q2 doesn't itself touch `ndarray::hpc` — it routes requests to consumers that do. q2's interaction with PR-X12 is at the orchestration layer: scheduling, codec version constraints, federated codebook policy.
+
+---
+
+## 9. What PR-X12 must NOT break
+
+In light of the above, the irreducible commitments PR-X12 must keep for the consumer ecosystem:
+
+1. **Substrate API stability** — `blas_level2::batched_gemm`, `cam_pq::kmeans`, `fft::dct_apply`, `activations::conv2d` keep their signatures across PR-X12 changes. Additions OK, breaks not OK.
+2. **Per-arch dispatch transparency** — consumers continue calling capability-trait methods; the substrate continues choosing the right SIMD path.
+3. **`Reducer<T>` ordered-sum guarantee** — any consumer using `OrderedKahanReducer` (or similar) continues to get bit-exact cross-arch reductions.
+4. **Latency-assertion CI infrastructure** — R-11's framework is consumer-callable for their own benches; not codec-private.
+5. **Codebook handle indirection** (R-13) — the codec ships with the handle pattern, consumers can swap codebooks without forking.
+
+If PR-X12 keeps these five things stable, the consumer crates inherit the win. If any one breaks, the cascade across burn/candle/lance-graph/surrealdb is weeks of remediation per affected crate.
+
+---
+
+## 10. Cross-references
+
+- **Substrate canon:** `pr-x12-substrate-merged-canon.md`
+- **Resolutions:** R-3, R-4, R-5, R-7, R-11, R-13 in `pr-x12-canon-resolutions-delta.md`
+- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md`
+- **Future capability lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md`
+- **WoA-side architecture:** check `woa-rs` repo `docs/architecture.md` (not in this repo)
+- **Q2 transport:** see `q2` repo for codebook gossip protocol design
+- **Federation policy reading:** R-13 calls out the model; q2 will implement
+
+_Last edit: 2026-05-22._
diff --git a/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md b/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md
new file mode 100644
index 00000000..493b8cb0
--- /dev/null
+++ b/.claude/knowledge/pr-x12-x265-blasgraph-gemm.md
@@ -0,0 +1,281 @@
+# PR-X12 — x265 / HEVC through the BLAS-GEMM Lens
+
+> Date: 2026-05-22
+> Status: **perspective doc** — re-reads the HEVC/x265 design space as a sequence of GEMM operations. Companion to `pr-x12-substrate-merged-canon.md` and `pr-x12-canon-resolutions-delta.md`.
+>
+> Premise: every x265 inner loop has a GEMM form. HEVC was designed in 2013 against hardware that made per-pixel butterflies the fast path; modern hardware (VNNI, AMX, BF16) inverts that ranking. PR-X12 is what x265 would have been with the right hardware floor.
+
+---
+
+## 0. The thesis in one sentence
+
+**x265 implements roughly nine inner loops, six of which collapse to GEMM under the SSD/k-means/tropical reformulations, three of which stay non-GEMM and live in cheap per-byte paths.** PR-X12 spends ~80% of encode time inside BLAS calls; HEVC reference spends ~30%. The reframing is not metaphor — it is an algebraic identity per stage.
+
+---
+
+## 1. The nine HEVC primitives, classified
+
+| # | Primitive | HEVC native form | GEMM form | Where it lands |
+|---|---|---|---|---|
+| 1 | Motion estimation | SAD `Σ \|A-B\|` | SSD `\|\|A\|\|² - 2A·B + \|\|B\|\|²` → GEMV | `ndarray::hpc::blas_level2::batched_ssd_search` |
+| 2 | Forward transform | 4×4 / 8×8 / 16×16 / 32×32 DCT-II butterflies | Batched DCT as GEMM at N≥64 | `ndarray::hpc::fft::DctIIBasis<N>` + `bf16_tile_gemm` |
+| 3 | Quantization | Scalar divide + round | Dot product against quant matrix | Inline; uses existing `simd_int_ops` |
+| 4 | Mode decision (CTU split) | Recursive RDO, `O(4^d)` | Tropical-GEMM Bellman-Ford, `O(d²)` | `lance-graph::blasgraph::tropical_gemm` |
+| 5 | Basin assignment (palette / k-means) | Linear scan distance comparisons | Batched Hamming/L2 dist as GEMM | `ndarray::hpc::cam_pq::kmeans` |
+| 6 | Deblocking filter | 3×3 / 5×5 per-pixel separable conv | im2col + GEMM at block size ≥ 16 | `ndarray::hpc::activations` (existing conv path) |
+| 7 | rANS state advance | u32 state machine | Symbol-frequency lookup; **not GEMM** | `ndarray-codec::ans` |
+| 8 | Header bit-pack | u16 shift+mask | Not GEMM (per-leaf, ~5 ns) | `src/hpc/codec/mode.rs::pack_header` |
+| 9 | Stream framing / sync | Byte-level append | Not GEMM | `ndarray-codec::stream` |
+
+Stages 1-6 (the inner-loop-cost-dominant ones) are all GEMM. Stages 7-9 are I/O-bound and stay per-byte. The boundary between them is sharp because GEMM amortises hardware fusion (AMX, VNNI) while state-machine code can't.
+
+---
+
+## 2. Per-stage detail — the algebraic moves
+
+### 2.1 Motion estimation: SAD → SSD (R-6)
+
+HEVC's reference encoder uses SAD because in 2007-2013, ARM hand-tuned `VPSADBW` was the fastest 16×16-block-difference primitive. SAD has no matrix structure — the absolute value inside the sum doesn't factor.
+
+SSD is algebraically richer:
+
+```text
+SSD(A, B)  = Σ_ij (A_ij - B_ij)²
+           = Σ A² - 2 Σ (A·B) + Σ B²
+           = ||A||² - 2·(A·B) + ||B||²
+                       ▲
+                       └── this is a GEMM/GEMV
+```
+
+For N motion-vector candidates against one reference block:
+
+- Candidate matrix `A_batch`: `(N × 256)` — 256 = 16×16 pixels per block
+- Reference vector `B`: 256-d
+- Middle term: `A_batch @ B` → `(N × 1)` GEMV
+- `||A_i||²` precomputed once per candidate window; `||B||²` once per reference
+
+**On Cascade Lake+ with VNNI:** `VPDPBUSD` = 64 i8·i8→i32 ops/cycle. One 256-elem dot product = 4 ops = ~4 cycles. Versus `VPSADBW` SAD path: ~128 cycles per 16×16. **Speedup: 30-50× depending on batch.**
+
+**On Sapphire Rapids with AMX:** TDPBUSD tile op = 256 i8·i8→i32 ops in one tile cycle. 16 candidates batched fits one AMX tile; throughput rises by another factor of 4.
+
+Net: motion estimation is ~50× faster than HEVC reference, *for the same wire-format semantics*. Same MV grid, same precision, same RDO. The math is identical; the substrate is BLAS.
+
+### 2.2 Transform: per-block butterflies → batched DCT (R-5)
+
+HEVC ships Loeffler / Lengwehasatit 1D DCT-II butterflies — fast at single-block sizes (~80 ops per 32×32 transform), bad at batched dispatch. The Loeffler factoring is what made 2010-era CPUs (no SIMD GEMM at small sizes) able to encode HEVC at all.
+
+PR-X12 keeps the butterflies for small N and dispatches to BLAS GEMM at N ≥ 64:
+
+```text
+N = number of contiguous transform blocks
+
+if N <  64:  per-block butterfly (Loeffler) — fits L1, no batching overhead
+if N >= 64:  batched DCT as GEMM via DctIIBasis<N> + bf16_tile_gemm
+             ~256 cycles for 64 blocks (AMX) vs ~1280 cycles butterfly
+```
+
+Crossover (R-5) varies per arch: SPR=64, SKX/ICL=32, Zen 4=96, Apple Silicon=256.
+
+**The trait pattern (R-1):** `DctIIBasis<const N>` implements `Basis<i16>` — the basis is data (the cosine matrix, computed once at startup). The reduction (`A4 transform path` and `EWA splat rasterizer Plan E`) both call `basis.apply(src, dst)`. **Same basis, two consumers.**
+
+### 2.3 Quantization: stays per-byte, doesn't need GEMM
+
+Scalar quantization is `q_ij = (coeff_ij * scale_ij) >> 15`. Per-coefficient cost ~1 ns; the entire 32×32 block quantizes in ~1000 ns scalar, no batching benefit. Stays at SIMD-batched i16 path (`simd_int_ops`), no GEMM layer.
+
+### 2.4 Mode decision: recursive RDO → tropical-GEMM (R-7)
+
+HEVC's partition decision walks the quad-tree recursively, computing Lagrangian cost at each split:
+
+```text
+For each CTU at depth d:
+    for each of 4 children:
+        recursive RDO at depth d+1
+        compute mode + transform + quant + rate + distortion
+    combine via min(D + λ·R)
+
+Time: O(4^d) per CTU.  At d=4 (PR-X12): 256 leaves worst case.
+```
+
+Tropical-semiring reformulation: the (+, min) algebra has GEMM. Build the 85-node DAG with edge weights `W[parent, child] = ΔRDO`, then iterate `D ← min(D, D + W)` (one tropical-GEMM step). Repeat for d iterations.
+
+```text
+Naive recursive:  O(4^4) = 256 ops × |nodes| = ~22 K ops/CTU
+Tropical-GEMM:    O(d²) × |nodes| = 16 × 85 = ~1.4 K ops/CTU
+                  ~16× speedup
+```
+
+For 4K @ 60 fps with 132K CTUs/frame, this is the difference between **4 ms and 64 ms per frame just for partition RDO**. At 60 fps's 16.67 ms budget, naive RDO doesn't fit.
+
+**Dep direction:** the tropical-GEMM kernel lives in `lance-graph::blasgraph` (it's been the cognitive-side substrate for years). Post-Plan-H, `ndarray-codec → lance-graph::blasgraph` is allowed because both are sibling crates above `ndarray` hardware.
+
+### 2.5 Basin assignment: k-means as batched dist + argmin
+
+For each cell, find the nearest of 4096 basin centroids:
+
+```text
+distances[c] = ||cell - centroid_c||²   for c in 0..4096
+basin = argmin(distances)
+```
+
+Both the distance computation and the argmin are batched primitives:
+
+- **Distance computation:** if cells are i8 fingerprints, batched Hamming distance via `VPOPCNTDQ` (Ice Lake+). If cells are f32/bf16, batched L2 via `_mm512_add_ps` after `_mm512_sub_ps`.
+- **Across 4096 centroids:** matrix form. `dist = ||cells||² ⊕ ||centroids||² − 2 · (cells @ centroids^T)`. Same SSD identity as ME, scaled to codebook size.
+
+`cam_pq::kmeans` already ships this in `src/hpc/`. The codec's basin-assign step is a thin wrapper.
+
+### 2.6 Deblocking filter: per-pixel conv → im2col GEMM (only at scale)
+
+3×3 / 5×5 separable filters at block edges. For a single CU's deblocking pass (~64 edge pixels), per-pixel conv wins. For batched deblocking across many CUs in a frame, im2col + GEMM wins by ~3-5× on AMX-class hardware.
+
+x265's deblocking is one of the few stages that explicitly has per-block-size branches; PR-X12 keeps the same structure but dispatches the batched form through `ndarray::hpc::activations`.
+
+### 2.7 rANS: stays as state machine
+
+Not a GEMM. State machine that reads symbols, looks up `(freq, cumfreq)`, advances u32 state. ~10 ns/symbol on modern x86. Per-frame rebuild of the frequency table is the only batchable step (a sum-reduce, trivially SIMD).
+
+### 2.8 Header bit-pack / stream framing
+
+Per-leaf, 5-30 ns. No GEMM. Lives in `mode.rs::pack_header` / `pack_leaf` and the future `stream.rs`.
+
+---
+
+## 3. Why HEVC's 2013 design space was BLAS-impoverished
+
+The HEVC spec was finalised in early 2013, against the following hardware:
+
+- **No VNNI** — Cascade Lake shipped 2019. `VPDPBUSD` is six years after HEVC was frozen.
+- **No AMX** — Sapphire Rapids shipped 2023. Ten years after the spec.
+- **No bfloat16** — first appeared on SPR. HEVC's transform precision was set to fit i16 because i16 GEMM on Sandy Bridge SSE4 was the only practical option.
+- **No `VPOPCNTDQ`** — Ice Lake 2019. HEVC's palette mode (SCC profile) was frozen with the assumption that 64-entry palettes were the cap, because larger palettes would have needed Hamming-distance GEMM that didn't exist.
+
+**The HEVC team made the right choices for 2013 hardware.** Per-pixel butterflies were faster than batched GEMM at small sizes. SAD via `VPSADBW` was faster than SSD via any 2013-era integer SIMD. 64-entry palettes were the largest size where the linear-scan k-means inner loop fit L1 budget.
+
+**Every one of those choices is now obsolete.** The PR-X12 substrate isn't a redesign of HEVC's wire format — it's HEVC's wire format with the inner loops swapped out for what 2026 hardware actually wants.
+
+---
+
+## 4. The reframing: PR-X12 IS x265 done as BLAS
+
+| Aspect | HEVC reference | PR-X12 |
+|---|---|---|
+| Wire format | 16-bit header + per-mode tail | **same** |
+| Mode taxonomy | Skip / Merge / Delta / Escape | **same** |
+| Quad-tree partition | 64×64 CTU → 8×8 leaf | **same**, `Ctu<const N>` runtime-flex (M:E-G) |
+| Palette / basin codebook | 64 entries max | 4096 entries (12-bit, full HHTL Leaf tree) |
+| RDO criterion | `D + λ·R` Lagrangian | **same** |
+| RDO solver | recursive `O(4^d)` | tropical-GEMM `O(d²)` (R-7) |
+| ME criterion | SAD | SSD (R-6) — algebraically lossless reframing |
+| Transform | per-block Loeffler | batched DCT GEMM at N≥64 (R-5) |
+| Entropy coder | CABAC | rANS — better Shannon-efficiency, simpler state |
+| In-loop deblocking | per-pixel conv | im2col GEMM at batch (existing infra) |
+
+**The wire format is unchanged.** A PR-X12-encoded video should be decodable by an HEVC-spec decoder (modulo the rANS↔CABAC swap and the 4096-entry palette), because the semantic primitives — Skip/Merge/Delta/Escape, quad-tree CTU, RDO Lagrangian, DCT-II basis — are identical.
+
+**What changed is the implementation.** Each inner loop is now a BLAS call.
+
+---
+
+## 5. What lands in `ndarray::hpc::blas_level2` (the codec's BLAS surface)
+
+The codec uses, but does not own, these four primitives:
+
+```rust
+// R-6: ME via SSD identity
+pub fn batched_ssd_search(
+    candidates: &[i8; 256],     // (N × 256) row-major
+    n_candidates: usize,
+    reference: &[i8; 256],
+    out_distances: &mut [u32],  // length N
+);
+
+// R-5: batched DCT-II via GEMM
+pub fn batched_dct_ii<const N: usize>(
+    blocks: &[i16],             // (M blocks × N×N) row-major
+    n_blocks: usize,
+    out: &mut [i16],            // output coefficients
+);
+
+// R-7: tropical-GEMM partition (lives in blasgraph, called from codec)
+pub fn tropical_partition_rdo(
+    edge_weights: &[f32; 85],
+    out_min_costs: &mut [f32; 85],
+);
+
+// k-means basin assignment (uses existing cam_pq)
+pub fn kmeans_predict_batched(
+    cells: &[Fingerprint],
+    centroids: &[Fingerprint; 4096],
+    out_basin_idx: &mut [u16],
+);
+```
+
+**Codec layer:** ~30-50 LoC per stage to wrap the BLAS call into the predict/A6/A4 flow. **BLAS layer:** zero new lines — all four already exist or land via existing infrastructure (`bf16_tile_gemm`, `cam_pq`, `simd_int_ops`).
+
+This is what makes R-3's ≤1500 generic-codec-LoC ceiling reachable. Most of the heavy lifting is already in `blas_level2`; the codec adds wrappers and orchestration, not new BLAS code.
+
+---
+
+## 6. The "blasgraph synergy" claim made precise
+
+Earlier docs cited "blasgraph + MKL synergies" loosely. Quantified:
+
+**Of nine codec inner loops, six dispatch to BLAS:**
+
+| Loop | BLAS primitive | Existing infra |
+|---|---|---|
+| ME | SSD via VNNI GEMV | `blas_level2` after R-6 lands |
+| Transform | Batched DCT GEMM | `bf16_tile_gemm` + `DctIIBasis<N>` |
+| Quant | Stays per-byte | n/a |
+| Mode decision | Tropical-GEMM | `lance-graph::blasgraph` |
+| Basin assign | Hamming/L2 batched dist | `cam_pq::kmeans` |
+| Deblocking | im2col GEMM | `activations` (existing conv path) |
+| rANS | Stays state-machine | n/a |
+| Header | Stays per-byte | n/a |
+| Framing | Stays per-byte | n/a |
+
+**On SPR with all six BLAS-dispatch paths active**, profile-guided estimate (calibrated during Plan G):
+
+- ~80% of total encode time spent inside BLAS calls
+- ~15% in rANS + header + framing (the per-byte paths)
+- ~5% in quantize + scalar housekeeping
+
+**HEVC reference encoder on the same SPR:** ~30% inside BLAS (mostly deblocking and ME bookkeeping); the rest is per-pixel butterflies + recursive RDO + SAD. The hardware sits idle 70% of the time at peak SIMD width.
+
+**The 50× ME speedup, 16× partition RDO speedup, and 4× transform speedup compose** because they sit in different stages of the encode pipeline. End-to-end encode at 4K @ 60 fps becomes feasible on a single SPR socket.
+
+---
+
+## 7. Plan G video lane: the falsifier
+
+Per R-4, the video lane of `codec-bench` clears `≥0.95× x265 ultrafast ratio at PSNR ±0.1 dB on Big Buck Bunny 1080p`. The R-11 latency assertion adds: total encode time for the clip must complete within (clip duration × 0.5).
+
+**The hidden falsifier in §6's BLAS-synergy claim:** if Plan G's video lane profile shows <60% time-in-BLAS, the BLAS reframing is decorative — actually a critical bug, because it means the per-byte stages (rANS / header / framing) are dominating, which means SIMD-batched-encode (R-11) didn't actually land on the codec hot path.
+
+**Suggested Plan G profile assert:** `perf stat -e cycles,instructions,L1_DCACHE_LOAD_MISSES` over the encode, with a sub-test breaking down cycles per stage. If the BLAS-dispatch stages don't sum to ≥60% of cycles, the abstraction is wrong somewhere.
+
+This is the kind of test that catches "we wrote the code but it's not actually using the GEMM path because the dispatcher fell through to scalar" — a class of bug that ate weeks of PR #134 / #175 SIMD work and only surfaced in CI.
+
+---
+
+## 8. What this lens unlocks for x266 / next-gen codecs
+
+The next document (`pr-x12-x266-3dgs-spacetime-upscaling.md`) asks what's possible if the substrate isn't x265-compatible — if we *replace* in-loop filters with 3DGS space-time interpolation. The answer becomes obvious once the codec is read as a GEMM pipeline: the in-loop filter is just another GEMM stage in the pipeline, and replacing it with a different GEMM (one whose output is a 3DGS-rendered reference frame) costs no architectural complexity — only ships a different `Basis<T>` impl.
+
+That's the bridge to the next doc.
+
+---
+
+## 9. Cross-references
+
+- **R-N citations:** `pr-x12-canon-resolutions-delta.md`
+- **Architecture canon:** `pr-x12-substrate-merged-canon.md`
+- **Mechanical spec:** `pr-x12-codec-x265-design.md` (what's getting reimplemented)
+- **Next lens:** `pr-x12-x266-3dgs-spacetime-upscaling.md`
+- **In-tree code:**
+  - `src/hpc/blas_level1.rs`, `blas_level2.rs`, `blas_level3.rs` — host for `batched_ssd_search`, `batched_dct_ii`
+  - `src/hpc/cam_pq.rs` — k-means basin assignment
+  - `src/hpc/bf16_tile_gemm.rs` — AMX-class GEMM dispatch
+  - `src/hpc/codec/{ctu,mode,predict}.rs` — codec wire format
+
+_Last edit: 2026-05-22._
diff --git a/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md
new file mode 100644
index 00000000..14ba0f2d
--- /dev/null
+++ b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md
@@ -0,0 +1,328 @@
+# PR-X12 — x266 / Next-Gen Codec via 3DGS Space-Time Upscaling
+
+> Date: 2026-05-22
+> Status: **speculative perspective doc** — explores what becomes possible when the codec substrate (PR-X12) is extended one step beyond HEVC compatibility, into territory that subsumes both AI-frame-interpolation and AI-super-resolution as codec-native deterministic operations. Companion to `pr-x12-x265-blasgraph-gemm.md`.
+>
+> Status caveat: nothing in this document is committed as PR-X12 scope. It's the future shape that PR-X12's substrate makes obvious. Plan E + Plan G prerequisites must land first.
+>
+> Premise: in-loop reference-frame reconstruction in HEVC is a 2D pixel-grid render. In a 3DGS-augmented codec, it's a re-rasterization from a 3D Gaussian scene model. Same trait (`Basis<T>`), different impl. The decoder becomes responsible for (resolution, frame-rate) at playback time, not (encoder, capture).
+
+---
+
+## 0. One-sentence thesis
+
+**HEVC's in-loop filter is a `Basis<T>::apply` call whose output happens to be a 2D pixel array.** Replace that with an EWA-splat `Basis<T>::apply` whose output is a 2D rasterization of a 3D Gaussian scene at a parameter-controlled (resolution, time), and *the same encoder produces a free space-time upscalable bitstream* — no AI frame interpolation, no neural super-resolution, just deterministic re-rasterization from a scene model that already lives in the wire format.
+
+---
+
+## 1. The capability gap PR-X12 closes
+
+Current state-of-the-art for high-quality playback at non-native rate:
+
+- **Frame interpolation:** DAIN, RIFE, FILM — learned optical flow models that hallucinate intermediate frames. Per-frame inference cost ~30-100 ms on a GPU. Non-deterministic across model versions. No codec integration.
+- **Super-resolution:** ESRGAN, Real-ESRGAN, DLSS-FG — learned upscalers. Per-frame cost similar. Same non-determinism and integration gap.
+- **Codec-native upscaling:** lanczos/bicubic — deterministic but low-quality; H.266/VVC adds Reference Picture Resampling, but it's still a 2D resample, not a 3D-scene-aware reconstruction.
+
+**The PR-X12 substrate exposes a third option:** ship a 3D scene model in the bitstream, and let the decoder render at arbitrary (res, fps). The 3D scene model is *the reference frame*, not a precomputed 2D image. This isn't novel as research (3DGS papers from 2023-2025) — it's novel as a *codec primitive*, because no codec has been able to express "the in-loop filter is a basis swap, swap it" cleanly. PR-X12 can.
+
+---
+
+## 2. 3DGS as a `Basis<T>` impl — the trait shape
+
+Recall (R-1, M:E-A): `LinearReduce<T>` decomposes into `Basis<T>` + reduction. The basis is data; the reduction is the inner loop. The codec's transform path calls `basis.apply(src, dst)`; Plan E's EWA splat rasterizer calls the same.
+
+```rust
+pub trait Basis<T: Copy> {
+    /// Apply this basis to a source array, writing into a destination.
+    /// For DCT: src = pixel block, dst = coefficient block.
+    /// For EWA: src = 3DGS scene params, dst = rasterized 2D pixel frame.
+    fn apply<R: Reducer<T>>(
+        &self,
+        src: &[T],
+        dst: &mut [T],
+        params: &Self::Params,  // basis-specific: viewport, time, etc.
+        reducer: R,
+    );
+
+    type Params;
+}
+
+// Existing (R-1, R-5):
+impl<const N: usize> Basis<i16> for DctIIBasis<N> {
+    type Params = ();
+    fn apply<R: Reducer<i16>>(&self, src: &[i16], dst: &mut [i16], _: &(), r: R) {
+        // batched DCT-II via bf16_tile_gemm at N >= 64
+    }
+}
+
+// Future (Plan E, then x266):
+impl Basis<f16> for EwaSplatBasis {
+    type Params = ViewportTime;  // camera pose + timestamp
+    fn apply<R: Reducer<f16>>(
+        &self,
+        gaussians: &[GaussianRecord],   // 3DGS scene (5-7 KB per cell, see §6)
+        out_frame: &mut [f16],          // 2D pixel buffer at target res
+        vp: &ViewportTime,              // (W, H, t) — chosen by decoder
+        r: R,
+    ) {
+        // Rasterize 3DGS scene at (W, H, t)
+        // Same per-tile GEMM pattern as ndarray-image's existing EWA path
+    }
+}
+
+struct ViewportTime {
+    width: u32,
+    height: u32,
+    time_ms: u64,           // frame timestamp; 3DGS scene is continuous in t
+    camera_pose: Mat3x4f,   // identity for monoscopic; non-trivial for VR
+}
+```
+
+**The crucial property:** the codec body (`ndarray-codec`) doesn't know whether it's calling `DctIIBasis` or `EwaSplatBasis`. It dispatches via the trait. The bitstream header (`Ctu` header bits, see M:E-J) selects which basis is in play.
+
+This is exactly the kind of substrate flexibility R-1 was designed to provide. Without R-1, this paragraph is fantasy; with R-1, it's a 6-week engineering effort to land Plan E and wire the trait.
+
+---
+
+## 3. The encoder problem: fitting a Gaussian scene model to a clip
+
+Encoding a video clip with a 3DGS scene anchor means: given N input frames at known camera pose (or estimated pose), find a 3DGS scene S such that rendering S at each frame's (pose, time) reproduces the input frames to within a target PSNR.
+
+This is a standard 3DGS fitting problem (Kerbl et al. 2023, Mip-Splatting 2024). The relevant fact for PR-X12:
+
+```text
+Input: N frames @ (1080p, 24 fps) for 10 seconds = 240 frames
+Output: scene S = ~100K-500K anisotropic Gaussians
+        ~32 bytes per Gaussian (position 3×f16, scale 3×f16,
+        rotation quaternion 4×f16, color 3×f16, opacity 1×f16 = 28 B)
+        + quantized SH coefficients for view-dependent color (~8-16 B)
+        Total: ~40-50 bytes per Gaussian × 200K = 8-10 MB per scene anchor
+
+Compare to:
+        240 frames × 3 MB (Bbb 1080p I-frame at HEVC Q=20) = ~720 MB raw I-frame
+        HEVC encoded @ ~5 Mbps = ~6.3 MB for the whole clip
+```
+
+So the scene-anchor encoding is the same order of magnitude as standard HEVC encoding *for one anchor period*. The win comes from:
+
+1. **Re-rasterization is free** — render at 4K, 8K, 60 fps, 120 fps, all from the same 8 MB scene model
+2. **Anchor periods stretch** — if motion is low, one anchor lasts 10+ seconds; HEVC has to re-encode I-frames every ~2 sec for random-access seek
+3. **View interpolation** — for VR/stereo, render two views from one scene; HEVC needs to encode two streams
+
+**The encoder pipeline:**
+
+```text
+Anchor frame n:
+    1. Estimate camera pose from frame n+0 through n+anchor_period (~240 frames)
+    2. Initialize Gaussian cloud from frame n's depth estimate
+    3. Optimize cloud via gradient descent: minimize Σ |render(S, t_k) - frame_k|²
+       (This is k-means-like; uses cam_pq infrastructure)
+    4. Quantize to scene-anchor format (see §5)
+
+Per-frame delta n+1, n+2, ...:
+    Standard HEVC inter-prediction against the 3DGS-rendered ref frame.
+    The 3DGS-rendered ref is computed by the decoder too, so the delta is
+    in the same algebraic space as HEVC.
+```
+
+The clever part: **the decoder's reference frame for inter-prediction is the 3DGS render at the previous frame's (pose, t)**. So the per-frame delta is small — most motion is already captured in the scene model.
+
+---
+
+## 4. The decoder problem: rasterizing at arbitrary (res, fps)
+
+The decoder receives:
+
+- Scene anchor: scene S (8-10 MB) at clip start, then refreshes every ~250 frames
+- Per-frame deltas: standard HEVC-like residual, against the 3DGS-rendered ref
+
+At playback time:
+
+```text
+For each output frame at (W_target, H_target, t_target):
+    1. Render scene S at (W_target, H_target, t_target) via EwaSplatBasis::apply
+       Output: ref_frame in pixel buffer
+    2. Decode per-frame delta against ref_frame
+    3. Apply standard HEVC in-loop filtering (deblock + SAO)
+    4. Emit pixel buffer
+```
+
+**Key observation:** step 1 is parametrised in (W, H, t). The encoder shipped a 1080p @ 24 fps clip; the decoder renders at 4K @ 60 fps by choosing different (W_target, H_target, t_target) tuples. The scene model is continuous in (W, H, t); the rasterizer interpolates.
+
+This is **codec-native space-time upscaling**, deterministic across decoder implementations because the math (EWA splat rasterization) is well-specified. Same scene model, same camera pose, same t → same pixels. No model versioning. No "frame interpolation v3 hallucinates differently than v2."
+
+**Cost per frame:** EWA splat raster of 200K Gaussians at 4K → ~5-15 ms on a modern GPU; ~50-100 ms on CPU. Tight for real-time decode at 60 fps on CPU; comfortable at 24-30 fps. R-11 latency assertion applies — Plan G's decoder lane must hit real-time at the target playback rate.
+
+---
+
+## 5. Wire format: scene-anchor frames + per-frame deltas
+
+Building on M:E-J's 16-bit header layout (header_kind ∈ {Skip, Merge, Delta, Escape}), x266 needs one new header_kind: **SceneAnchor**.
+
+```text
+HEVC-compatible PR-X12 header (16 bits, R-2):
+    bits 0-1:   header_kind {Skip, Merge, Delta, Escape}
+    bits 2-13:  basin_index (12 bits, M:E-J)
+    bit  14:    CONSUMER-TYPED (semantic per frame-header `ConsumerProfile`;
+                cognitive: Pearl-rung high bit; video: reserved=0;
+                splat: LOD-cascade-source flag; gradient: worker-shard parity)
+    bit  15:    UNIVERSAL "has inter-tier reference" (A3-inter); identical
+                across all four consumers
+    NOTE: leaf-size (8/16/32/64) is encoded structurally via `Ctu<const N>`
+    (M:E-G) at the type level, not via header bits.
+
+x266 extension (NOT in PR-X12 scope, future):
+    bits 0-1:   header_kind, now 4 variants
+                  00 = Skip (HEVC-compatible)
+                  01 = Merge (HEVC-compatible)
+                  10 = Delta (HEVC-compatible)
+                  11 = Escape OR SceneAnchor (escape bit at byte boundary disambiguates)
+    bits 2-15:  basin_index (12 bits) + scene_anchor_id (2 bits) when in anchor mode
+```
+
+**Anchor frame payload** (after the 16-bit header):
+
+```text
+SceneAnchorFrame:
+    scene_id: u8                        // which anchor in the GOP
+    num_gaussians: u24                  // typically 50K - 500K
+    cam_pose_keyframes: u8              // number of pose anchors
+    [GaussianRecord; N]:                // 40-50 bytes each, quantized
+        position: [u16; 3]              // q15 fixed-point per axis
+        scale_log: [u8; 3]              // log-quantized
+        rot_quat: [u8; 4]               // quantized to 8-bit
+        sh_coeffs: [u8; 27]             // 9 coefs per channel × 3 channels, q7
+        opacity: u8
+    pose_keyframes: [(t_ms: u32, Mat3x4f); cam_pose_keyframes]
+```
+
+Per-frame deltas after the anchor are standard HEVC-like, with one difference: the reference frame is derived by rasterizing the anchor scene at the frame's (pose, t), not by decoding a prior I-frame.
+
+**Bitstream compatibility:** an HEVC-spec decoder that doesn't understand `SceneAnchor` headers can fall back to displaying the inter-frame deltas as zero-padded macroblocks (visibly broken, but won't crash). A PR-X12 decoder with EwaSplatBasis loaded plays back at native quality.
+
+---
+
+## 6. Bandwidth math: when does this beat HEVC?
+
+Rough rule (calibrated against published 3DGS papers):
+
+```text
+Clip:   10 seconds, 1080p, 30 fps, modest motion (e.g. Bbb sample)
+
+HEVC reference (5 Mbps avg, hardware encoded):
+    bytes = 5 × 10⁶ × 10 / 8 = 6.25 MB
+
+PR-X12 + 3DGS anchor (single anchor for the clip):
+    anchor: 200K Gaussians × 40 B = 8 MB
+    deltas: ~300 frames × 1 KB avg = 300 KB
+    Total: 8.3 MB
+
+→ HEVC wins by ~25% for native (1080p, 30 fps) playback.
+
+BUT for 4K @ 60 fps playback:
+    HEVC: re-encode at 4K/60fps target = 4 (res) × 2 (fps) × 6.25 = 50 MB
+            (4× pixel scaling × 2× framerate scaling × 6.25 MB native bitrate;
+             or super-res upscaling at decode = 6.25 MB + neural inference)
+    PR-X12 + 3DGS: same 8.3 MB
+            decoder rasterizes at (4K, 60 fps); the math is in the scene
+
+→ PR-X12 wins by ~6× for high-resolution playback,
+   AND playback is deterministic (no neural model versioning).
+```
+
+**Where the crossover sits:** PR-X12 + 3DGS becomes a win when the playback target (W × H × fps) exceeds the encode target by ~1.3× (the point at which HEVC's re-encoded size crosses the fixed 8.3 MB PR-X12 budget). At 1× (native), HEVC is a hair cheaper. At 8× pixel-bandwidth (4K@60 from 1080p@30), PR-X12 dominates by ~6×.
+
+This matches the intuition that **3DGS is a scene model**, not a frame model — its compression ratio improves with resolution, while HEVC's degrades.
+
+---
+
+## 7. The "free upscaling" insight — why this isn't AI
+
+Critics will read §6 and say "this is just AI upscaling rebranded." The distinction is sharper than it sounds.
+
+**AI upscaling** (DLSS, ESRGAN, Real-ESRGAN, RIFE, DAIN, FILM):
+- Input: 2D pixel array at low res
+- Model: learned NN with millions of parameters; non-deterministic across versions
+- Output: 2D pixel array at high res, with hallucinated detail
+- Failure mode: hallucinates wrong detail (e.g. wrong text on a sign)
+- Latency: per-frame ~30-100 ms on a GPU
+- Codec integration: zero
+
+**PR-X12 + 3DGS rasterization** (this doc):
+- Input: 3D Gaussian scene + camera pose
+- Model: closed-form EWA splat formula (Zwicker et al. 2001, refined in 3DGS papers)
+- Output: 2D pixel array at any res, computed deterministically
+- Failure mode: misses detail that wasn't in the scene model — but never hallucinates
+- Latency: per-frame ~5-15 ms on a GPU; ~50-100 ms on CPU
+- Codec integration: full, basis trait dispatch
+
+The 3DGS scene captures the actual 3D geometry of what was in front of the camera. Rasterizing at higher resolution doesn't invent detail — it *samples the 3D scene more finely*. If the encoder couldn't fit a detail (e.g. the text on a small sign), the decoder can't recover it. That's a **failure of completeness**, not a failure of fidelity. Compare to AI upscaling, which has both modes and can't tell you which is happening.
+
+For high-stakes video (legal evidence, medical imaging, scientific recording), this distinction matters. PR-X12 + 3DGS is **legally and scientifically defensible** in a way no learned upscaler can be.
+
+---
+
+## 8. PR-X12 prerequisites
+
+Nothing in this doc is in PR-X12 scope. What it requires from PR-X12:
+
+| Requirement | Source | Status |
+|---|---|---|
+| `Basis<T>` trait with parametric `apply` | R-1, M:E-A | landed in concept; implementation in Plan A4 |
+| EWA splat rasterizer as `Basis<f16>` impl | Plan E | scheduled |
+| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit |
+| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | landed |
+| Plan G video lane validates per-arch latency | R-4, R-11 | scheduled |
+| Federated codebook policy for scene anchors | R-13 | landed |
+
+The path to x266-like capability is:
+
+1. Land PR-X12 (HEVC-compatible, no 3DGS). Plan A4 → Plan H.
+2. Plan E ships EWA splat as `Basis<f16>`.
+3. New crate `ndarray-codec-scene` (or extension within `ndarray-codec`) adds `SceneAnchor` header kind + scene encoder/decoder pipelines.
+4. Bench against AI upscaling pipelines (RIFE / Real-ESRGAN) on quality and latency.
+5. Standardise the wire format extension (separate spec, not HEVC-compatible).
+
+Conservative estimate: **24-36 months from PR-X12 merge**, assuming Plan E lands on schedule and 3DGS encoder math is taken from existing research (no novel algorithms required).
+
+---
+
+## 9. Falsifiers
+
+What kills this path? Be specific:
+
+**F-1: Encoder math doesn't converge for general video.** 3DGS papers focus on static scenes with controlled camera motion. Real video has occlusion, transparency, fast motion. If 3DGS fitting can't hit PSNR ≥ 35 dB on motion-heavy clips (e.g. sports footage) within reasonable encode time, the substrate is decorative. **Mitigation:** restrict scope to slow-camera-motion content (talking heads, drone footage, security cameras); HEVC stays the fallback for sports.
+
+**F-2: Decoder rasterization too slow.** If EwaSplatBasis::apply can't hit real-time at 4K @ 60 fps on a 2026-class CPU, the codec is server-side only. **Mitigation:** PR-X12's R-11 latency assertion catches this in CI; if the CPU path fails, the codec emits a GPU-required flag in the bitstream.
+
+**F-3: Wire format ossifies.** If HEVC stays dominant and x266 adoption is slow (the H.266/VVC story so far — 2020 release, still <5% market share in 2026), the SceneAnchor extension never sees a standards body. **Mitigation:** ship it as a non-standard extension first, in an open-source decoder; let market traction force standardisation.
+
+**F-4: Patents.** 3DGS-as-codec-primitive may sit in a patent thicket. Some 3DGS rendering optimisations (tile binning, depth sorting) are likely patented. **Mitigation:** the basis trait is general; if Gaussian splats are patented, swap to another basis (TensoRF, NeRF compaction, point cloud + bilinear) — same architecture, different math.
+
+None of these falsifiers invalidate PR-X12 itself. They only constrain the post-PR-X12 path.
+
+---
+
+## 10. Why this lens matters now, for PR-X12 scoping
+
+The temptation in scoping PR-X12 is to optimise for HEVC compatibility only — strip out anything that doesn't directly serve the x265-replacement story. **The basis trait (R-1) and the EWA-splat schedule (Plan E) survive that pruning** because they were independently motivated. This doc makes the case that they were also the right call by another measure: they're the substrate that lets x266 happen at all.
+
+Concretely:
+
+- **Do not** weaken `Basis<T>` to be DCT-only "for now." The generality has zero LoC cost (the trait is the same) and unlocks 3DGS later.
+- **Do** keep Plan E on the roadmap even if Plan H/codec-fast-path pressure tries to defer it. EWA splat is the first non-DCT basis and validates the trait shape.
+- **Do** keep the codec body free of basis-specific code. M:H-NEW-2's "ratchet on codec LoC at the basis boundary" already enforces this; the x266 lens is why it matters.
+
+---
+
+## 11. Cross-references
+
+- **Substrate dependencies:** R-1, R-2, R-3, R-11, R-13 in `pr-x12-canon-resolutions-delta.md`
+- **Basis trait architecture:** §M:E-A in `pr-x12-substrate-merged-canon.md`
+- **EWA splat planning:** Plan E in `pr-arc-inventory.md`
+- **Codec foundation:** `pr-x12-codec-x265-design.md`
+- **GEMM lens:** `pr-x12-x265-blasgraph-gemm.md`
+- **Bandwidth comparison reading list:** 3DGS (Kerbl et al. SIGGRAPH 2023), Mip-Splatting (Yu et al. 2024), 4DGS (Wu et al. 2024)
+
+_Last edit: 2026-05-22._
+_Status: speculative — explores what's possible after PR-X12 lands; not in PR-X12 scope._
diff --git a/.gitignore b/.gitignore
index bf2b8312..286c990a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,3 +10,6 @@ target/
 
 # Claude Code: agent isolation worktrees (temporary, per-agent)
 .claude/worktrees/
+
+# Claude Code: per-user permission overrides (survives branch switches)
+.claude/settings.local.json