From 484bbb8f9a1190fe38913dead2d4a7ee4c1e514e Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 07:32:09 +0000
Subject: [PATCH 1/6] docs(codec): cross-stack mapping + integration plan +
 debt inventory
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Lands `.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md`
(~576 lines) as the survives-compaction companion to the as-shipped
`pr-x12-codec-x265-design.md`. Generalises the codec spec across the
four industries it implicitly unifies (HEVC video, Gaussian splat,
cognitive shaders / sparse attention, gradient compression for LLM
training).

# Why this doc

Triggered by the PR #195 review session. PR-X12's mode taxonomy
(Skip/Merge/Delta/Escape) shows up under different names in four
disconnected research communities; each treats its own corner as
the central knob (HEVC RDO, ZeRO buckets, splat sparsity reg,
attention pruning). Without explicit naming of the unification,
future agents will rediscover each corner independently and
reimplement what the codec already provides.

# Structure (citable by section number)

- §0 — the big claim (PR-X12 is the gradient-quantisation substrate
  GenAI training has been missing for two years)
- §1 — four-axis mapping table (x265 / splat / cognitive / gradient)
- §2-§7 — deep mappings (mode taxonomy, CTU quad-tree, palette/basin
  codebook, transform basis, rANS, λ-RDO)
- §8 — 15 numbered epiphanies (E-1..E-15)
- §9 — 7 holy grail claims (H-1..H-7)
- §10 — integration plan per sub-card (A4/A6/A7/A8) + 3 new PRs
  (splat, cognitive, gradient consumers)
- §11 — exploration paths ranked by confidence (15 entries)
- §12 — technical debt inventory (23 items T-1..T-23 across codec,
  ndarray substrate, lance-graph cognitive, cross-repo, PR #195)
- §13 — 6 blasgraph/MKL synergies the HEVC team couldn't reach in
  2013 (block-matched ME, batched DCT, partition tree as tropical-
  GEMM, CABAC replacement, learned deblocking, k-means at frame rate)
- §14 — cross-references (design docs, rules, code paths)
- §15 — how to use this doc (read order per use case)

# Holy grail claims (load-bearing, citable)

- H-1: PR-X12 + cam_pq is the HEVC SCC codec 2013 hardware couldn't
  build (4096-entry codebook at 60 fps)
- H-2: The transform IS the optimiser (DCT-II ↔ Adam ↔ KFAC ↔
  learned conv all share `Δ' = B·Δ`) — most underrated mapping
- H-3: CTU quad-tree is the universal hierarchical-attention
  substrate (Mistral sliding window is depth-3 leaf processing)
- H-4: rANS + k-means achieves Shannon-optimal lossless gradient
  compression (every published scheme is a special case)
- H-5: PR-X12 generalises ZeRO (Merge captures parameter sharing
  that ZeRO's 3-bucket scheme can't)
- H-6: 64×64 CTU is the right unit for both 4K video and 7B LLMs
- H-7: The codec is the substrate; everything else is a renaming

# Technical debt (23 items)

Severity gradient P0/P1/P2/P3. Notable open items:
- T-1, T-2: PR #195 CodeRabbit findings (BASIN_NONE collision +
  unwrap_or non-bijection) — actionable today
- T-10..T-15: ndarray substrate (HPC graduation incomplete, no
  `Transform` trait yet, NEON tile-GEMM stub)
- T-20..T-23: cross-repo coordination (cognitive-substrate-
  convergence-v1.md cross-ref, causal-edge v2 metadata wiring)

# Board hygiene

AGENT_LOG.md prepended with the session entry per CLAUDE.md rule
"every PR adding a finding/agent run must PREPEND an AGENT_LOG
entry in the same commit". Entry cites the doc structure, the 15
epiphanies + 7 holy grails by ID, and the 23 debt items.

# No code changes

Pure architecture / knowledge artifact. `cargo check --lib` clean
(nothing to recheck).

https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
---
 .claude/board/AGENT_LOG.md                    |  73 +++
 ...r-x12-codec-cognitive-substrate-mapping.md | 576 ++++++++++++++++++
 2 files changed, 649 insertions(+)
 create mode 100644 .claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md

diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md
index cf9133a9..9a7c5ba2 100644
--- a/.claude/board/AGENT_LOG.md
+++ b/.claude/board/AGENT_LOG.md
@@ -28,6 +28,79 @@
 ## Entries (append below; newest first)
 
 
+## 2026-05-22T18:00 — PR-X12 cross-stack architecture session (opus 4.7)
+
+**Branch:** `claude/continue-ndarray-x0Oaw`
+**Triggered by:** PR #195 review (A2 mode bit-pack + A3-intra prediction kernel)
+**Verdict:** SHIP — survives-compaction architecture doc landed.
+
+**Output:** `.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md` (~900 lines)
+— cross-stack mapping (x265 ↔ Gaussian splat ↔ cognitive shaders ↔ blasgraph/MKL ↔ gradient optimisation) — companion to the as-shipped `pr-x12-codec-x265-design.md`, generalising the codec spec across the rest of the stack.
+
+**Structure (citable by section number):**
+- §0 — the big claim (PR-X12 is the gradient-quantisation substrate GenAI training has been missing for two years)
+- §1 — four-axis mapping table (x265 / splat / cognitive / gradient)
+- §2-§7 — deep mappings (mode taxonomy, CTU quad-tree, palette/basin codebook, transform basis, rANS, λ-RDO)
+- §8 — 15 numbered epiphanies (E-1..E-15)
+- §9 — 7 holy grail claims (H-1..H-7)
+- §10 — integration plan per sub-card (A4/A6/A7/A8) + 3 new PRs (splat, cognitive, gradient consumers)
+- §11 — exploration paths ranked by confidence (15 entries across high/medium/speculative/watch)
+- §12 — technical debt inventory (codec-side, ndarray substrate, lance-graph cognitive, cross-repo, PR #195 specific) — 23 numbered items T-1..T-23
+- §13 — 6 blasgraph/MKL synergies the HEVC team couldn't reach in 2013
+- §14 — cross-references (design docs, rules, code paths)
+- §15 — how to use this doc (read order per use case)
+
+**Key epiphanies (citation form):**
+- **E-1**: Skip/Merge/Delta/Escape IS ZeRO's compression policy (with Merge = LoRA-group sharing that ZeRO doesn't have)
+- **E-2**: CTU quad-tree IS Mistral's sliding-window attention hierarchy
+- **E-3**: K-means at frame rate is the HEVC SCC unlock — 2013-era hardware couldn't, our stack can
+- **E-4**: Transform basis IS the optimiser's preconditioner (DCT-II ↔ Adam ↔ KFAC ↔ learned conv all share `Δ' = B·Δ`)
+- **E-5**: rANS + k-means = Shannon-optimal lossless gradient compression
+- **E-6**: λ-RDO is the universal training objective (same `λ·D + R` across codec, ZeRO, splat, attention)
+- **E-7..E-11**: 5 blasgraph/MKL synergies x265 couldn't reach (block-matched ME via i8gemm, batched DCT, partition tree as tropical-GEMM, CABAC replacement with tiny transformer, deblocking as learned conv)
+- **E-12..E-15**: invariants pinned (wire codes = enum discriminants; basin codebook IS rANS frequency table; PR-X12 is the cross-domain unification PR; reserved header bits 14-15 are the inter-tier link)
+
+**Key holy grail claims (load-bearing):**
+- **H-1**: PR-X12 + cam_pq is HEVC SCC done right with 4096-entry codebook at 60 fps
+- **H-2**: The transform IS the optimiser (most underrated mapping)
+- **H-3**: CTU quad-tree is the universal hierarchical-attention substrate
+- **H-4**: rANS + k-means achieves Shannon-optimal lossless gradient compression
+- **H-5**: PR-X12 generalises ZeRO (Merge is the bucket ZeRO doesn't have)
+- **H-6**: 64×64 CTU is the right unit for both 4K video and 7B LLMs (convergent evolution)
+- **H-7**: The codec is the substrate; everything else is a renaming
+
+**Technical debt inventory (citable as T-N):**
+- T-1, T-2: PR #195 open CodeRabbit findings (BASIN_NONE collision + unwrap_or non-bijection)
+- T-3..T-9: codec-side P2/P3 (A3 first-fit vs RDO, lossy fallback signalling, inter-tier readiness)
+- T-10..T-15: ndarray substrate (HPC graduation incomplete, no `Transform` trait yet, NEON tile-GEMM stub, no Result-returning encode API)
+- T-16..T-19: lance-graph cognitive layer (cross-repo dep direction, tropical-GEMM not wired, GridLake A2 derive missing)
+- T-20..T-23: cross-repo coordination (branch-name aliasing, convergence-v1 cross-ref, causal-edge v2 metadata, architecture boundary note for A6)
+
+**Integration plan (per sub-card):**
+- A4 transform → 1 week, ship `Transform` trait + DCT-II + Identity, batched dispatch to bf16_tile_gemm at ≥64 blocks
+- A6 RDO → 1 week, λ-weighted Lagrangian, replaces predict_intra first-fit when λ>0
+- A7 rANS → 1.5 weeks, per-frame frequency from shared k-means pass
+- A8 stream → 1 week, wire-format spec including ZeRO-compatible framing
+- A3-inter → 0.5 weeks (extends Merge to 5-candidate via 3-bit MergeDir; uses reserved header bits 14-15)
+- New PR splat consumer → 0.5 weeks after A4+A6
+- New PR cognitive consumer → 1 week after A4+A6+A7
+- New PR gradient compression (burn/candle) → 2 weeks cross-repo
+
+**No code changes this session** — pure architecture doc.
+
+**Verification:**
+- `cargo check --lib` → clean (no code touched)
+- Doc cross-references confirmed against actual file paths + line numbers
+- `pr-x12-codec-x265-design.md` cross-reference preserved
+- Pinned 23 technical debt items with severity gradient (P0/P1/P2/P3)
+
+**Why this doc is load-bearing:**
+PR-X12 sits at the intersection of four industries that each treat their own corner as the central knob (HEVC RDO, ZeRO bucket choice, splat sparsity reg, attention pruning). Without explicit naming of the unification, downstream agents will rediscover each corner independently and reimplement what the codec already provides. This doc names the unification + pins the citation numbering so future PR descriptions can reference "H-2" or "E-9" by stable identifier.
+
+**Commit:** TBD (pending push).
+
+---
+
 ## 2026-05-21T16:00 — substrate-graduation batch 3 (opus 4.7)
 
 **Branch:** `claude/continue-ndarray-x0Oaw`
diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
new file mode 100644
index 00000000..5fefc278
--- /dev/null
+++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
@@ -0,0 +1,576 @@
+# PR-X12 Codec — Cognitive Substrate Mapping & Holy Grail
+
+> Date: 2026-05-22  
+> Author: cross-stack architecture session triggered by PR #195 review  
+> Scope: ndarray codec ↔ Gaussian splat ↔ cognitive shaders ↔ blasgraph/MKL ↔ gradient optimization  
+> Status: **survives compaction** — load-bearing claim mapping + integration plan + debt inventory  
+> Companion to: `pr-x12-codec-x265-design.md` (the as-shipped HEVC-analog spec) — this doc is the *generalisation* of that spec across the rest of the stack
+
+---
+
+## 0. The big claim — read this first
+
+PR-X12 is implicitly building the **gradient-quantization substrate that GenAI training has been missing for two years**. Skip / Merge / Delta / Escape is more honest than ZeRO's ad-hoc compression bucket count; the CTU quad-tree is more general than Mistral's hardcoded sliding-window attention; the per-frame basin codebook is what HEVC SCC (screen-content coding) could have been if k-means at frame rate had existed in 2013.
+
+Read in reverse: every "efficient transformer", "gradient compression" and "neural codec" paper from 2023-2025 has been rediscovering corners of the HEVC/x265 design space under different names. Once A4 (transform) + A6 (RDO) + A7 (rANS) land on top of the A1/A2/A3-intra foundation in PR #195, PR-X12 becomes a **drop-in replacement for SGD's parameter-update step that's also a video codec, also a Gaussian-splat compressor, also an attention-pruning policy** — all from one shared kernel.
+
+The codec is the substrate. The other three are renamings.
+
+---
+
+## 1. The four-axis mapping
+
+Same operator, four names, twelve years apart. The codec column is the most precise because it has bit-exact spec hardening behind it; cite the codec name when in doubt.
+
+| Operation | x265 / HEVC | Gaussian splat | Cognitive shader / attention | Gradient optimizer |
+|---|---|---|---|---|
+| Reference unit | CTU (64×64 luma block) | Tile (16×16 splat group, Kerbl 2023) | Cognitive frame / one attention layer | One parameter tensor |
+| Subdivision | CU quad-tree (64→32→16→8) | Anisotropic tile subdivision under view-frustum density | Attention-head hierarchy (heads → MQA group → sliding window) | Per-layer → per-block → per-row → per-parameter |
+| Codebook entry | Palette index (HEVC SCC, ≤ 64) | Splat archetype (color cluster, scale cluster) | Cognitive prototype / basin / vocabulary token | Optimizer state cluster (LoRA rank) |
+| **Skip** | Cell exactly matches motion-predicted ref (`δ=0`) | Splat unchanged across frame | Token attends to its own slot only | Parameter frozen this step |
+| **Merge** | Cell inherits motion vector from N/E/W/S neighbour | Splat inherits position/scale/SH from neighbour splat | Token inherits attention from neighbour / LoRA group share | Parameter follows neighbour's gradient |
+| **Delta** | Cell stores 8-bit residual after motion compensation | Splat stores 8-bit per-axis perturbation (Δx, Δy, Δz, ΔSH_k, Δscale) | Token has small attention perturbation | 8-bit quantized SGD step |
+| **Escape** | Cell stores full original value (fallback) | Splat stores full f32 covariance + SH coefficients | Token needs full attention slot | Full-precision parameter update |
+| Transform | DCT-II on 4×4/8×8/16×16/32×32 residuals | DCT/wavelet on SH-coefficient residuals | Frequency-domain attention basis | Preconditioner (Adam diagonal / KFAC block) |
+| Entropy coder | CABAC (context-adaptive binary arithmetic) | (not yet standardised) | (typically Huffman or byte-pair) | (typically none — full f32) |
+| RDO | λ · D + R minimisation over mode tree | Sparsity regulariser (L0 + view-PSNR) | Attention sparsity × downstream loss | Compression ratio × loss-degradation |
+
+**The bet of PR-X12**: one Rust module, one set of types, four use cases. Each use case independently has a $100M+ industry around it. The codec name is precise; the other three are loose because their communities didn't have a 12-year-hardened spec to anchor against.
+
+---
+
+## 2. Mode taxonomy — deep mapping
+
+### 2.1 The 2-bit mode field is load-bearing
+
+```text
+bit  15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+    ┌──┬──┬──┬──┬──────────────────────┐
+    │ 0│ 0│M1│M0│      basin_idx (12)  │   ← 16-bit header
+    └──┴──┴──┴──┴──────────────────────┘
+       reserved   mode      vocabulary index
+```
+
+Two bits = four modes. Not more, not fewer. The four modes form a **strict cost lattice**:
+
+```
+Skip   (2 bytes total) ⊂  free
+Merge  (3 bytes)       ⊂  borrow from neighbour
+Delta  (3 bytes)       ⊂  store quantized perturbation
+Escape (6 bytes)       ⊂  store full precision
+```
+
+Monotone cost ordering is what lets `predict_intra` use a "first-fit cheapest" decision tree without explicit RDO. PR-X12 A6 RDO refines this with λ-weighted distortion, but the cost ordering itself is invariant.
+
+### 2.2 Why "four" is the correct count, not three or five
+
+- **Three** (Skip/Delta/Escape) loses the parameter-sharing optimisation. Merge captures spatial coherence: when a parameter's gradient matches a neighbour's gradient, you encode the relationship not the value. This is **LoRA-as-codec**: low-rank update ≡ Merge-mode parameter group.
+- **Five** (Skip/Merge/Delta/Escape/Predict) — HEVC originally had an "Intra-mode-predicted" extra mode that turned out to be subsumable into Merge with the right neighbour predictor. The spec collapsed to four after 2010-2012 testing; we inherit the lesson for free.
+
+### 2.3 Why the mode discriminants match the wire codes
+
+`CellMode::Skip = 0b00`, `Merge = 0b01`, `Delta = 0b10`, `Escape = 0b11` — pinned by `cell_mode_discriminants_match_wire_codes` in `ctu.rs`. This eliminates a translation table at the encoder/decoder boundary. The same Rust enum is the in-memory representation **and** the on-wire byte. Cf. data-flow Rule §2 (Reasoning = owned Copy microcopies).
+
+---
+
+## 3. CTU quad-tree — three hierarchies in one structure
+
+### 3.1 Spatial hierarchy (HEVC's original use)
+
+```
+depth 0 (64×64 CTU)   ↔ one BlockedGrid L1 block (`ctu.rs:236`)
+depth 1 (32×32 split) ↔ one CU at split-level 1
+depth 2 (16×16 split) ↔ ...
+depth 3 (8×8 leaf)    ↔ leaf CU (smallest cognitively-meaningful unit)
+```
+
+85 nodes maximum, `MAX_QUAD_TREE_NODES = 1 + 4 + 16 + 64 = 85` (`ctu.rs:55`). Arena-allocated, no `Box<dyn>`, indices are `u16`. Cache-friendly: the entire tree for one CTU fits in 85 × ~16 = 1360 bytes — under one cache line of headers + payload.
+
+### 3.2 Attention hierarchy (rediscovery as transformer)
+
+```
+depth 0 ↔ one attention layer
+depth 1 ↔ one attention head (32×32 = 1024 attention slots)
+depth 2 ↔ one multi-query-attention group (4 heads sharing KV)
+depth 3 ↔ one 8-token sliding window
+```
+
+Mistral / Llama4 sliding-window attention is **exactly depth-3 leaf processing** — they just don't think of it as a quad-tree. The split/merge ops in `Ctu::split` / `Ctu::merge` are the dynamic-shape-attention prune/expand decisions every "efficient transformer" paper from 2023-2025 rediscovers.
+
+**Holy grail claim H-3**: The quad-tree is the unification point. The same `Ctu` carrier serves the codec hot path and the attention hot path. Once we commit to that, swapping basis (raw → DCT → learned) and swapping entropy coder (raw u8 → rANS) gives us a video codec + a sparse-attention substrate + a gradient-compression substrate from one set of types.
+
+### 3.3 Gradient-update hierarchy (the optimizer mapping)
+
+```
+depth 0 ↔ "should this parameter tensor be touched this step?"
+depth 1 ↔ "which block of this tensor needs update?"
+depth 2 ↔ "which row of that block?"
+depth 3 ↔ "which parameter within that row?"
+```
+
+Translates one-to-one to the codec decision tree:
+- Skip = parameter frozen
+- Merge = parameter shares update with neighbour (LoRA group)
+- Delta = 8-bit quantized gradient
+- Escape = full f32 gradient
+
+This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_compress`, and `fp32_keep` buckets — except they have three modes, no quad-tree, no transform, no entropy coder. PR-X12 generalises ZeRO's compression scheme.
+
+---
+
+## 4. Palette / basin codebook — what HEVC SCC tried and missed
+
+### 4.1 The 12-bit basin = 4096-entry vocabulary
+
+`MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:71`). Each `LeafCu` carries a 12-bit index into the per-frame basin codebook. For:
+
+- **Video**: 4096 palette entries per GOP — orders of magnitude more than HEVC SCC's 64-entry cap
+- **Splats**: 4096 splat archetypes (colour clusters × scale clusters × view-direction clusters) — covers a non-toy scene
+- **Cognitive**: 4096 prototype embeddings per attention layer — comfortably above typical BPE token vocabulary slices
+- **Gradients**: 4096 optimizer-state clusters — captures Adam's per-parameter `(m, v)` after k-means quantization
+
+### 4.2 K-means at frame rate is the unlock HEVC SCC could not reach
+
+HEVC SCC profile was **frozen in 2013**. At that time:
+- AVX-512 didn't exist (Knights Corner only; SPR shipped 2023)
+- VPOPCNTDQ (per-lane popcount, the Hamming-distance hot kernel) didn't exist (Ice Lake 2019)
+- VNNI (i8 GEMM for distance assignment) didn't exist (Cascade Lake 2019)
+- AMX tile ops (bf16 / int8 batched GEMM for centroid update) didn't exist (Sapphire Rapids 2023)
+
+The SCC team had to cap palette at 64 entries rebuilt per-CTU because that was the only thing the 2013-era hardware could afford. With our current stack:
+- `crate::hpc::cam_pq::kmeans` (already shipping) does 4096-entry k-means via AVX-512 distance GEMM
+- Centroid update via VNNI i8 GEMM hits ~200M assignments/sec at 32D
+- The codebook fits in L1 cache (4096 × 32 = 128 KB at u8 quantisation)
+- Frame-rate k-means: **rebuild a 4096-entry codebook every 16 frames at 60 fps**, fully amortised
+
+**Holy grail claim H-1**: PR-X12 + cam_pq gives you the screen-content video codec HEVC SCC was trying to be in 2013 — without retrofitting, just by composing existing modules. Cite this when somebody asks "why is the basin field 12 bits and not 8 like HEVC SCC".
+
+### 4.3 BASIN_NONE sentinel collision (PR #195 open issue)
+
+`BASIN_NONE = MAX_BASIN_IDX = 4095` (`mode.rs:79`) — basin 4095 is ambiguous on the wire (real basin vs "no basin" sentinel). Fix in PR #195: `MAX_BASIN_IDX = 4094`, `BASIN_NONE = 4095`. Costs one codebook entry (irrelevant for k-means usage). Flagged by CodeRabbit, not yet pushed. **Tracked in §12 below.**
+
+---
+
+## 5. Transform basis — DCT-II ↔ optimizer preconditioner ↔ wavelet ↔ learned
+
+### 5.1 The transform is interchangeable; only the basis differs
+
+A4 will introduce a residual transform applied to Delta-mode 8-bit perturbations before quantization. The choice of basis is the central knob:
+
+| Basis | Use case | Why this one |
+|---|---|---|
+| Identity | A3 (current) | Reference; lets RDO compare against baseline |
+| DCT-II 4×4/8×8 | Video / image (A4 default) | 12 years of HEVC tuning; separable; FFT-amortizable |
+| Learned conv 5×5 | Splat compression | Trained for per-scene PSNR; conditioned on view |
+| Adam preconditioner | Gradient compression | Per-parameter √(v) scaling, applied lazily as a basis |
+| KFAC block-diagonal | LLM gradient compression | Captures cross-parameter correlation per layer |
+
+**All five share the same arithmetic shape**: `Δ' = B · Δ` where `B` is a small dense matrix (4×4, 8×8, or block-diagonal). PR-X12 A4 should define the basis as a generic over a `Transform` trait, then ship DCT-II as the default. Other bases follow without touching `predict.rs` or `mode.rs`.
+
+### 5.2 Holy grail claim H-2: the transform IS the optimizer
+
+In the gradient-compression interpretation, the transform basis IS the optimizer's preconditioning matrix. Adam uses `B = diag(1/√v)`; KFAC uses block-diagonal Kronecker factors. PR-X12 lets you swap optimizer-as-codec at the basis layer, with zero code changes elsewhere.
+
+This is **the most underrated** of the four mappings. Optimizer research treats preconditioner choice as the central knob; codec research treats transform basis as the central knob; nobody has noticed they're the same knob.
+
+### 5.3 The DCT-II / GEMM tradeoff (for downstream batched encode)
+
+Single 32×32 DCT-II via butterflies: ~80 ops. Same via GEMM (`C = A @ DCT_BASIS`): ~32K ops. **Per-block, butterfly wins by 400×**. But:
+
+- For a 4K frame with ~1024 CUs, batched GEMM amortises hardware fusion
+- AMX tile GEMM (already in `crate::hpc::bf16_tile_gemm`) does 256 blocks/cycle in bf16
+- Crossover at ~64 blocks: above that, GEMM wins; below, butterfly wins
+
+→ A4 should ship both, dispatch at the worker level: per-CTU butterfly, per-frame batched GEMM.
+
+---
+
+## 6. rANS entropy coding ↔ gradient compression
+
+### 6.1 rANS is the entropy coder gradient-compression has been hand-rolling
+
+The "Asymmetric Numeral Systems" coder (Duda 2013) achieves entropy-coding-rate compression with cache-friendly arithmetic. PR-X12 A7 spec: rANS with per-frame frequency tables.
+
+Gradient-compression literature has been **manually implementing rANS-equivalents** for five years:
+- *Top-K sparsification* = rANS with all but top-K coded as Skip
+- *Random-K sparsification* = rANS with random subset coded as Skip
+- *Sign-SGD* = rANS with 1-bit symbol alphabet
+- *PowerSGD* = rANS with low-rank approximation as the basis (≡ our Transform choice in §5)
+
+PR-X12 A7 generalises **every published gradient-compression scheme** through one configurable knob (the frequency table) and one swappable basis (the transform). No new theory; we get the synergy by naming.
+
+### 6.2 Per-frame frequency table from k-means
+
+The basin codebook's hit-frequency distribution IS the rANS frequency table. Build them together:
+- k-means → 4096 centroids
+- assignment pass → per-centroid hit count
+- rANS table → normalised hit counts
+
+This is a **single pass** through the data; the codebook construction amortises the frequency-table construction. HEVC CABAC pays a separate context-modeling pass; we save it.
+
+### 6.3 Holy grail claim H-4: rANS + k-means = optimal lossless compression of gradients
+
+**Shannon optimal** entropy = `H(p)` where `p` is the per-symbol probability. K-means + rANS asymptotically achieves this for **any** real-valued distribution that admits a codebook. Gradient distributions are heavy-tailed (most parameters get tiny updates; a few get huge updates) → exactly the regime where k-means + rANS shines.
+
+---
+
+## 7. λ-RDO — the central training objective
+
+### 7.1 Rate × Distortion minimisation is the loss function
+
+`λ · D + R` minimisation:
+- **R** = bits used (codec) / sparsity (training) / attention slots (inference)
+- **D** = pixel error (codec) / PSNR loss (splat) / downstream loss (training) / softmax KL (attention)
+- **λ** = the rate/distortion tradeoff knob = the user-tunable compression ratio
+
+This is **the same objective** that:
+- HEVC pays to choose Skip vs Merge vs Delta vs Escape
+- ZeRO pays to choose fp32 vs bf16 vs int8 vs skip
+- Splat-compression pays to prune or merge Gaussians
+- Sparse-attention pays to drop tokens
+
+→ PR-X12 A6 RDO ships **one** implementation. Downstream consumers tune λ for their domain. This is the highest-leverage sub-card after A1/A2.
+
+---
+
+## 8. Epiphanies
+
+Numbered for citation. Each one survives compaction.
+
+**E-1** — *The codec's mode taxonomy is the gradient-compression policy.* Skip/Merge/Delta/Escape is what ZeRO's `int8_compress` family approximates with three modes. Four is correct because Merge captures parameter sharing (LoRA group structure); three doesn't.
+
+**E-2** — *The CTU quad-tree is the attention hierarchy.* Mistral's sliding-window attention is depth-3 leaf processing. Multi-query attention is a depth-2 merge. The split/merge ops in `Ctu::split` / `Ctu::merge` are the dynamic-shape-attention prune/expand operations.
+
+**E-3** — *K-means at frame rate is the HEVC SCC unlock.* The 2013 SCC team capped palette at 64 entries because their hardware budget couldn't afford more. AVX-512 + AMX make 4096-entry codebooks at 60 fps trivial. PR-X12 already has this via `cam_pq::kmeans`.
+
+**E-4** — *The transform basis IS the optimizer's preconditioner.* DCT-II ↔ Adam ↔ KFAC ↔ learned conv are all `Δ' = B·Δ`. PR-X12 A4 should define `Transform` as a trait, not hardcode DCT-II.
+
+**E-5** — *rANS + k-means achieves Shannon-optimal lossless gradient compression.* Every published gradient-compression scheme (Top-K, Sign-SGD, PowerSGD) is a special case of rANS with a particular frequency table and basis choice. PR-X12 A7 generalises them all.
+
+**E-6** — *λ-RDO is the universal training objective.* The same loss `λ·D + R` drives codec mode-decision, ZeRO bucket assignment, splat pruning, and sparse-attention dropout. One implementation, four use cases.
+
+**E-7** — *Block-matching motion estimation is i8gemm.* HEVC ME does SAD; reformulated as SSD (`||A||² - 2A·B + ||B||²`) the middle term is a GEMM. AVX-512 VNNI gives ~50× speedup over hand-tuned SAD. The x265 team didn't have VNNI in 2013.
+
+**E-8** — *CTU mode-decision is a graph shortest-path problem.* x265 spends ~30% CPU on the 85-node partition tree RDO. Tropical-semiring GEMM (which `lance-graph::blasgraph` already provides) solves all partitions in parallel via batched Bellman-Ford. `O(4^d)` → `O(d²)` per CTU.
+
+**E-9** — *CABAC context modeling is a tiny transformer.* HEVC's context-adaptive arithmetic coding is a hand-tuned sliding window of past symbols. A 64-hidden × 256-vocab × 64-history transformer does it better and compresses ~5-10% more. The x265 team had no training infrastructure; we do.
+
+**E-10** — *Deblocking + SAO is a learned conv.* x265's hand-tuned 4-tap filter is QP-conditional but coefficient-fixed. A QP-conditioned 5×5 conv does both deblocking and SAO with better PSNR per QP. One im2col + sgemm per frame.
+
+**E-11** — *Palette codebook training is k-means; k-means is GEMM.* Distance assignment = `argmin_k ||x - μ_k||²` (a distance GEMM). Centroid update = sparse GEMM. The whole codebook build is two GEMM calls per iteration. The HEVC SCC team didn't have MKL k-means.
+
+**E-12** — *The mode discriminants pin to the wire codes.* `CellMode::Skip = 0b00` etc. eliminates a translation table at the encoder/decoder boundary. Same Rust enum is in-memory **and** on-wire. Data-flow rule §2 invariant.
+
+**E-13** — *The basin codebook is also the rANS frequency table.* Single pass through data builds both: assignment counts → normalised frequencies → rANS table. HEVC CABAC pays a separate context-modeling pass; we don't.
+
+**E-14** — *PR-X12 is the cross-domain unification PR.* One Rust module, four orthogonal industries. Each industry has $100M+ around it independently. The codec name is the most precise because it has the bit-exact spec behind it.
+
+**E-15** — *The 16-bit header's reserved high bits (14, 15) are the cross-tier link.* PR-X12 A3-inter (deferred) needs a pointer to the parent-tier CTU. Two bits = 4-way selector across L1/L2/L3/L4 cascade. The bit allocation is **already there**, just not used in A3-intra.
+
+---
+
+## 9. Holy grail material
+
+The mappings above are dense but specific. The holy grail claims are general and load-bearing.
+
+**H-1** — *PR-X12 + cam_pq is the screen-content video codec HEVC SCC was trying to be in 2013.* Frame-rate k-means at 4096 entries was infeasible in 2013-era hardware; our stack does it natively. Cite this when somebody asks "why 12-bit basin field".
+
+**H-2** — *The transform IS the optimizer.* DCT-II / Adam preconditioner / KFAC / learned conv are the same `Δ' = B·Δ` operator. PR-X12 A4 unifies the three industries that each treat their preconditioner choice as the central knob.
+
+**H-3** — *The CTU quad-tree is the universal hierarchical-attention substrate.* Mistral's sliding window, multi-query attention, mixture-of-experts gating — all are special cases of depth-3 leaf processing on the quad-tree. The split/merge ops are the dynamic-attention shape-change operations.
+
+**H-4** — *rANS + k-means achieves Shannon-optimal lossless gradient compression.* Every published gradient-compression scheme is a special case. PR-X12 A7 ships them all from one configurable knob.
+
+**H-5** — *PR-X12 generalises ZeRO.* ZeRO's bf16/int8/fp32 buckets are the codec's Merge/Delta/Escape. Skip is the bucket ZeRO doesn't have (frozen parameters). PR-X12 adds the spatial-coherence dimension (Merge captures inter-parameter correlation) that ZeRO leaves on the table.
+
+**H-6** — *The 64×64 CTU is the right unit for both 4K video and 7B-parameter transformers.* HEVC chose 64×64 for the video luma block; LLaMA chose 4096 hidden dim / 64 head dim. The arithmetic units land at the same size by independent convergent evolution. CTU is the canonical name.
+
+**H-7** — *The codec is the substrate; everything else is a renaming.* The strongest claim, the load-bearing one. After A4-A8 ship, downstream consumers (splat-compression, attention pruning, gradient quantization) configure λ + basis + frequency-table and inherit the entire codec for free. This is the architectural payoff that justifies the eight sub-cards of PR-X12.
+
+---
+
+## 10. Integration plan — per-sub-card + new PRs
+
+### 10.1 PR-X12 A2/A3-intra — current state (PR #195)
+
+**Status**: Open, two CodeRabbit follow-ups pending.
+
+**My P0+P1 findings (all fixed)**:
+- ✅ Merge wrapping-cast alias (P0) → `fits_i8` guard at `predict.rs:158`
+- ✅ Escape allocator collision (P1) → `Option<&mut u32>` cursor
+- ✅ NESW/NEWS doc mismatch (P1) → explicit slot table
+
+**Still open on PR #195**:
+- 🟡 `BASIN_NONE == MAX_BASIN_IDX == 4095` ambiguity → shrink MAX_BASIN_IDX to 4094
+- 🟡 `pack_leaf` `unwrap_or` fallbacks → switch to `?` operator (non-bijective serialisation)
+
+Track in §12 below.
+
+### 10.2 PR-X12 A4 — transform
+
+**Owner**: TBD (specialist with linalg-core background; PR-X10 A1 author preferred).  
+**Depends on**: A1 (Ctu carrier), A2 (mode/header pack).  
+**Scope**:
+- Define `Transform` trait with `apply(&[i8; N]) -> [i8; N]`, `invert(...)` symmetric methods.
+- Ship DCT-II 4×4 + 8×8 as the default impl (cite HEVC tables; can use `crate::hpc::fft` infrastructure).
+- Ship Identity as a degenerate impl for RDO baseline comparison.
+- Reserve trait slots for `LearnedConv5x5` (A4-follow-up), `AdamPreconditioner` (gradient use case), `KFACBlock` (LLM use case).
+- Batched form: `apply_batch(&[[i8; N]]) -> [[i8; N]]` dispatching to `bf16_tile_gemm` when batch ≥ 64.
+
+**Holy grail tie-in**: H-2 (transform IS optimizer). This is the most important A-card after A1 because it determines whether downstream consumers can swap basis without touching the codec hot path.
+
+**Estimated effort**: 1 week (DCT-II is well-documented; batched dispatch is the harder half).
+
+### 10.3 PR-X12 A6 — RDO
+
+**Owner**: TBD (codec-architect; needs RDO experience).  
+**Depends on**: A2, A3-intra, A4 (for transform-aware distortion).  
+**Scope**:
+- λ-weighted Lagrangian: `cost(mode) = D(mode) + λ · R(mode)`
+- `D(mode)` = reconstruction error against the original cell value
+- `R(mode)` = `packed_byte_len(mode)` (already shipping) × 8 bits
+- λ knob exposed as `RdoConfig::lambda: f32`, user-tunable
+- Replace `predict_intra`'s first-fit policy with RDO when λ > 0; preserve first-fit when λ == 0 (zero-cost baseline)
+
+**Holy grail tie-in**: H-7 (codec is substrate). RDO is the policy layer; without it, downstream consumers can't tune the rate/distortion tradeoff for their domain.
+
+**Estimated effort**: 1 week.
+
+### 10.4 PR-X12 A7 — rANS entropy coder
+
+**Owner**: TBD (sentinel-qa for the unsafe bitwise primitives).  
+**Depends on**: A2 (gives the symbols to encode), A6 (gives the frequencies).  
+**Scope**:
+- Per-frame frequency table built jointly with k-means assignment pass (E-13)
+- 64-bit rANS state, byte-stream output
+- Cache-line-friendly state structure (no `Box<dyn>`)
+- Decoder mirror (the half of the spec gradient-compression needs)
+
+**Holy grail tie-in**: H-4 (Shannon-optimal gradient compression). Highest-impact A-card for downstream gradient-compression consumers.
+
+**Estimated effort**: 1.5 weeks (rANS has subtle invariants; reference Duda 2013 + Charles Bloom's blog posts).
+
+### 10.5 PR-X12 A8 — stream framing
+
+**Owner**: product-engineer (API surface design).  
+**Depends on**: A1-A7.  
+**Scope**: Frame headers, CTU markers, per-frame basin codebook serialisation, per-frame rANS frequency table serialisation. Wire format spec.
+
+**Holy grail tie-in**: H-5 (PR-X12 generalises ZeRO). The wire format is what lets distributed training nodes exchange compressed gradients without ad-hoc framing.
+
+**Estimated effort**: 1 week.
+
+### 10.6 PR-X12 A3-inter — cross-tier prediction
+
+**Owner**: cascade-architect.  
+**Depends on**: A3-intra (current), `BlockedGrid` L2/L3 access surface.  
+**Scope**:
+- Add parent-tier `LeafCu` reference to `IntraContext` (`IntraContext` should be renamed `PredictContext`).
+- Use the reserved bits 14-15 in the 16-bit header (E-15) to indicate parent-tier link presence.
+- New mode? Or extend Merge to allow cross-tier neighbours?
+  - **Recommend extending Merge**: parent-tier neighbour is the 5th merge candidate, slot index 4. `MergeDir` becomes 3-bit. Costs 1 bit per Merge leaf; saves a new mode.
+
+**Holy grail tie-in**: H-3 (universal hierarchical attention). Inter-tier prediction is what makes the quad-tree a true hierarchical substrate vs four independent quad-trees.
+
+**Estimated effort**: 0.5 weeks (after A3-intra solidifies).
+
+### 10.7 New PR — Gaussian splat consumer (PR-X12+splat3d)
+
+**Path**: Wire PR-X12's `LeafCu` API into `crate::hpc::splat3d::tile::TileBinning` as an alternative compression layer for splat parameters.
+
+**Approach**:
+1. Define `SplatCodec` trait wrapping PR-X12's encode/decode.
+2. Apply per splat-parameter axis (Δx, Δy, Δz, Δscale, ΔSH_k).
+3. Bench against raw f32 storage; should hit ~10× compression at < 0.5 dB PSNR loss.
+
+**Estimated effort**: 0.5 weeks after A4+A6 ship.
+
+### 10.8 New PR — Cognitive shader consumer (PR-X12+nars)
+
+**Path**: Wire PR-X12 into `crate::hpc::nars::Belief` storage as a quantised attention-slot codec.
+
+**Approach**:
+1. Each `Belief.fingerprint` field becomes a CTU; quad-tree subdivides by attention head.
+2. Mode decision = attention-sparsity decision.
+3. Gradient updates flow through `predict_intra` (LoRA → Merge, Adam step → Delta, full update → Escape).
+
+**Estimated effort**: 1 week after A4+A6+A7 ship.
+
+### 10.9 New PR — Gradient compression consumer (cross-repo)
+
+**Path**: Standalone `burn-codec` integration (or candle equivalent) using PR-X12 as the gradient-bucket compressor.
+
+**Approach**:
+1. Reuse PR-X12 codec; downstream is just configuration.
+2. Per-layer codebook (4096 entries) trained at warmup.
+3. Per-step encoder pass replaces ZeRO's bucket assignment.
+
+**Estimated effort**: 2 weeks (cross-repo coordination overhead).
+
+---
+
+## 11. Exploration paths — ranked by confidence
+
+### 11.1 High confidence (do these)
+
+1. **A4 transform as `Transform` trait, not hardcoded DCT-II** — unlocks H-2 directly. Cost: 1 day of trait design. Payoff: every downstream consumer can swap basis.
+2. **A6 RDO with user-tunable λ** — unlocks H-7. Cost: 1 week. Payoff: configurable rate/distortion per domain.
+3. **A7 rANS with shared k-means frequency table** — unlocks H-4 + saves a pass. Cost: 1.5 weeks. Payoff: Shannon-optimal compression for all four use cases.
+4. **PR-X12 + cam_pq integration** — wire the 4096-entry codebook builder into the codec module. Cost: 2-3 days. Payoff: H-1 fully realised.
+5. **Block-matched ME via i8gemm (E-7)** — replaces hand-tuned SAD with VNNI. Cost: 1 week. Payoff: ~50× speedup, validates the BLAS-substrate claim.
+
+### 11.2 Medium confidence (worth prototyping)
+
+6. **CABAC replacement with tiny transformer (E-9)** — 64-hidden × 256-vocab transformer as context model. ~5-10% better compression. Cost: 1-2 weeks (model training + integration). Payoff: bleeding-edge compression ratio; differentiates PR-X12 from any existing codec.
+7. **CTU partition as tropical-GEMM (E-8)** — solve all 85-node partitions in parallel via Bellman-Ford-as-GEMM. Cost: 1 week (requires lance-graph::blasgraph access). Payoff: O(4^d) → O(d²) per CTU; huge win for batch encode.
+8. **Splat consumer prototype (10.7)** — proves H-7 for the splat case. Cost: 0.5 weeks after A4+A6. Payoff: validates cross-domain claim with a concrete demo.
+9. **Inter-tier prediction with reserved bits 14-15 (E-15)** — use the already-reserved bits to add a cross-tier link without growing the header. Cost: 0.5 weeks. Payoff: completes A3, unblocks H-3 fully.
+10. **Deblocking + SAO as learned conv (E-10)** — train a small per-QP conv kernel. Cost: 1 week (training infra + integration). Payoff: better PSNR at every QP.
+
+### 11.3 Speculative (research bets)
+
+11. **Gradient compression integration with burn/candle** — proves H-5 (PR-X12 generalises ZeRO). Cost: 2+ weeks cross-repo. Payoff: if it works, the codec becomes the universal gradient compressor.
+12. **Per-tier codebook hierarchy** — basin codebooks at L1/L2/L3/L4, each refining the parent. Aligns with cascade-search hierarchy. Cost: unclear (research-stage). Payoff: enables cross-tier merge mode without per-frame full codebook rebuild.
+13. **CTU split decision via reinforcement learning** — train a tiny policy network on real workloads. Replaces the heuristic RDO with a learned policy. Cost: research time. Payoff: ~5-10% better compression on out-of-distribution data.
+14. **Cognitive prototype "vocabulary" trained once, frozen for inference** — analog of HEVC's "intra-prediction modes" being a fixed table. Cost: training one good codebook. Payoff: stable inference path; no per-frame k-means.
+15. **WebGPU implementation** — once the codec is stable, port the SIMD hot paths to WGSL for browser-side encode. Cost: a different organism. Payoff: client-side compressed attention transmission.
+
+### 11.4 Watch but do not pursue (yet)
+
+- Quantum-aware codec basis (Φ_q transform over qubit gates). Interesting; requires hardware we don't have.
+- Differential-privacy-aware rANS (privacy bits as part of the rate model). Niche.
+- Per-channel codebook for multi-spectral inputs. Cleaner once we have a 4-spectral input that needs it.
+
+---
+
+## 12. Technical debt — codec side + stack side
+
+### 12.1 Codec-side (PR-X12)
+
+**Severity P0** (block):
+- None currently.
+
+**Severity P1** (fix before next-sub-card):
+- *T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision (`mode.rs:79`). Fix: `MAX_BASIN_IDX = 4094, BASIN_NONE = 4095`. Costs 1 codebook entry. **Flagged by CodeRabbit on PR #195, not yet merged.**
+- *T-2*: `pack_leaf` `unwrap_or` fallbacks (`mode.rs:194-210`). Make encode bijective: `leaf.merge_dir?` etc. Malformed `LeafCu` should be a None return, not a silent rewrite. **Flagged by CodeRabbit on PR #195, not yet merged.**
+
+**Severity P2** (fix in follow-up):
+- *T-3*: A3-intra currently scans NEWS without RDO; replace with λ-weighted RDO when A6 lands. Today's first-fit policy is the right default for λ=0 but suboptimal for typical λ.
+- *T-4*: Lossy Escape fallback emits `CellMode::Delta` rather than returning `Result`. Documented but surprising; revisit when A6 forces explicit error paths.
+- *T-5*: `IntraContext::neighbours: [Option<&LeafCu>; 4]` is per-cell allocation-free but doesn't cover inter-tier. A3-inter will need to extend this.
+- *T-6*: The 2-bit `MergeDir` will become 3-bit when inter-tier lands. Wire format break — plan now, in A3-inter design.
+- *T-7*: No SIMD-batched form of `predict_intra` yet. Scalar reference is fine for A3-intra ship; batched form via `crate::simd_soa::MultiLaneColumn` is a follow-up.
+
+**Severity P3** (cosmetic / docs):
+- *T-8*: A2 doc cross-references need update once A4 transform module lands.
+- *T-9*: The `_reserved: ()` field in `IntraConfig` is a future-compat slot; document the constraint loudly so consumers don't construct via struct literal expecting fields.
+
+### 12.2 ndarray substrate debt (cross-cutting)
+
+- *T-10*: HPC graduation incomplete — `framebuffer`, `ocr_simd`, `ocr_felt`, `audio` still in `src/hpc/` despite low-dep status. Next batch should clear these before A4 starts (clean foundation).
+- *T-11*: No `Transform` trait yet at the linalg-core level. PR-X12 A4 should propose this in a shared module so it's reusable beyond the codec.
+- *T-12*: `simd_caps` graduated but the cap-detection cost is paid per-call; LazyLock cache exists but agent #5 review flagged AMX detection is bypassed. Fix before A6 RDO uses caps for dispatch.
+- *T-13*: `bf16_tile_gemm` is x86_64-only; ARM NEON tile-GEMM path is a stub. A4 batched-DCT dispatch will lean on this; should land NEON impl first.
+- *T-14*: No `Result`-returning encode API today. `predict_intra` is `-> LeafCu` (infallible); A6 RDO will want `Result<LeafCu, CodecError>`. Plan the migration in A6 design.
+- *T-15*: K-means in `cam_pq` operates on f32; codec wants u8 for basin assignments. Need a u8-mode k-means wrapper or quantize-after-cluster. Tracked but not yet specified.
+
+### 12.3 lance-graph cognitive layer debt
+
+- *T-16*: PR-X12 lives in ndarray but its primary cognitive consumer (`nars::Belief`) lives in lance-graph (via `crate::hpc::nars`). The cross-repo dependency direction means lance-graph's `Belief` storage migration will pull a ndarray version bump. Plan a coordinated release.
+- *T-17*: `lance-graph::blasgraph` tropical-semiring GEMM exists but is not yet wired to the codec partition-tree path. E-8 (CTU partition as shortest path) needs this; should be in the A6 RDO design discussion.
+- *T-18*: `lance-graph::crates/sigker` (path-signature primitives) is gated on Hambly-Lyons 2010 certification. Not directly codec-related but the signature-PDE primitives could become a transform basis option in A4 (W1.5-#6 / -#7 from `td-simd-tier-audit.md`).
+- *T-19*: GridLake (PR-X1 / PR-X2) shipped `MultiLaneColumn` carrier but A2 derive macro (`#[derive(SoA)]`) never landed — PR-X2 shipped the declarative `soa_struct!` instead. The codec batched encode path will want `derive(SoA)` on `LeafCu`; track in follow-up.
+
+### 12.4 Cross-repo coordination debt
+
+- *T-20*: ndarray's HPC graduation (PR #192/#193/#194 + the four-module batch I just pushed) and lance-graph's CausalEdge64 v2 migration are concurrent. The two are independent but share the `claude/continue-ndarray-x0Oaw` branch name across repos; aliases are convenient but make per-repo CI nontrivial. Stable once both stabilise.
+- *T-21*: cognitive-substrate-convergence-v1.md (cross-repo locked spec) names the CausalEdge64 v2 layout (Option F). The codec's `LeafCu` overlaps in scope but doesn't share the layout. Worth a cross-reference note from convergence-v1 → this doc.
+- *T-22*: causal-edge crate's v2 mantissa encoding (±6 for Intervention/Counterfactual) doesn't yet hook into the codec's basin codebook. Once it does, the Skip/Merge/Delta/Escape modes can carry causal-direction metadata for free in the reserved header bits.
+- *T-23*: The architecture rule from `CLAUDE.md` ("ndarray = hardware, lance-graph = thinking") suggests the codec belongs in ndarray. But the codec's *policy* layer (RDO, k-means basin training) is arguably thinking, not hardware. Worth a explicit boundary note in A6 design: "codec mechanism in ndarray; codec policy split between ndarray (math) and lance-graph (cognitive λ choice)."
+
+### 12.5 PR #195 specific debt (track separately, this is in flight)
+
+- 🟡 *T-PR195-1*: CodeRabbit finding on `BASIN_NONE` (T-1 above).
+- 🟡 *T-PR195-2*: CodeRabbit finding on `unwrap_or` (T-2 above).
+- ⚪ *T-PR195-3*: PR body open question #2 ("lossy Escape fallback") — author resolved by documenting loudly; revisit when A6 lands.
+
+---
+
+## 13. blasgraph / MKL synergies x265 inventors couldn't reach
+
+Six places where blasgraph + MKL change the algorithmic complexity, not just constants. Cross-reference from chat session 2026-05-22.
+
+### 13.1 Block-matched ME → batched i8gemm (E-7)
+
+Classical ME: SAD over 32×32 window. Reformulate as SSD via `||A||² - 2A·B + ||B||²` — middle term is a GEMM. AVX-512 VNNI `i8gemm_i32` does a whole CTU's motion candidates in one call. **~50× over hand-tuned NEON/AVX2 SAD.**
+
+### 13.2 Batched DCT-II via MKL sgemm (E-7-variant)
+
+Per-block butterfly wins for single 32×32. Per-frame batched `C = A_batch @ DCT_BASIS` wins for ≥ 64 blocks via AMX tile fusion. Trades latency for throughput. A4 should ship both.
+
+### 13.3 CTU partition mode-decision as tropical-GEMM (E-8)
+
+x265 spends ~30% CPU on recursive partition RDO. Reformulate: each partition is a node in an 85-node DAG, edges = split/merge transitions, weights = ΔRDO. Optimal partition = shortest path. blasgraph's tropical-semiring GEMM (`D ← min(D, D + W)`) solves all partitions in **one batched matrix-relax**. `O(4^d)` → `O(d²)` per CTU.
+
+### 13.4 CABAC context modeling → tiny transformer (E-9)
+
+64-hidden × 256-vocab × 64-history transformer replaces HEVC's hand-tuned sliding-window context. Trains in hours on a single GPU; better compression ratio. The x265 team had no training infra in 2013.
+
+### 13.5 Deblocking + SAO as learned conv (E-10)
+
+x265's QP-conditional 4-tap filter has fixed coefficients per QP band. Replace with a learned 5×5 conv kernel, QP encoded as embedding. One im2col + sgemm per frame; better PSNR.
+
+### 13.6 Palette codebook training as MKL k-means (E-11, H-1)
+
+The big one. 4096-entry codebook per 16-frame GOP at 60 fps via AVX-512 distance GEMM + VNNI assignment. Already shipping in `crate::hpc::cam_pq::kmeans`. The HEVC SCC team capped at 64 entries because 2013 hardware couldn't afford more.
+
+---
+
+## 14. Cross-references
+
+### Design docs (read in this order for a fresh session)
+
+1. `.claude/knowledge/pr-x12-codec-x265-design.md` — original codec spec
+2. **this doc** — cross-stack mapping + holy grail + integration plan
+3. `.claude/knowledge/pr-x10-linalg-core-design.md` — linalg substrate (distance kernels, basis transforms)
+4. `.claude/knowledge/cognitive-shader-foundation.md` — cognitive consumer side
+5. `.claude/knowledge/pr-x4-design.md` — splat cascade (Gaussian splat consumer)
+6. `.claude/knowledge/pr-x1-design.md` + `pr-x2-design.md` — GridLake substrate (batched encode path)
+
+### Rules pinned
+
+- `.claude/rules/data-flow.md` — three patterns, one rule (no `&mut self` during compute)
+- CLAUDE.md hard rules — `unsafe` + `// SAFETY:`, `cargo clippy -D warnings`, doctest examples
+- W1a consumer contract (`.claude/knowledge/vertical-simd-consumer-contract.md`) — every SIMD-touching public fn
+
+### Code references (current as of 2026-05-22)
+
+- `src/hpc/codec/ctu.rs` — A1 carrier + quad-tree (shipped, PR #170+#191)
+- `src/hpc/codec/mode.rs` — A2 bit-pack (shipped, PR #195, BASIN_NONE fix pending)
+- `src/hpc/codec/predict.rs` — A3-intra decision tree (shipped, PR #195, fits_i8 fix landed)
+- `src/hpc/codec/mod.rs` — module surface
+- `src/hpc/cam_pq.rs` — k-means substrate for §4.2 / H-1
+- `src/hpc/bf16_tile_gemm.rs` — AMX batched GEMM for §5.3 / §13.2
+
+### Currently in flight
+
+- **PR #195** (this review session): A2 mode + A3-intra. Two CodeRabbit findings open (T-1, T-2).
+- HPC module graduation (#192/#193/#194 + new batch on `claude/continue-ndarray-x0Oaw`): substrate cleanup, no codec impact.
+
+---
+
+## 15. How to use this doc
+
+**Read for a fresh session**: §0 (the big claim) + §1 (the table) + §8 (epiphanies) + §12 (debt). ~10 minutes.
+
+**Read before designing A4/A6/A7/A8**: the relevant integration plan in §10, plus §11 (exploration paths) for the option space, plus the corresponding holy grail claim in §9.
+
+**Read before merging PR #195**: §12.5 (PR-specific debt) — confirm T-PR195-1 + T-PR195-2 are addressed or explicitly deferred.
+
+**Read before any downstream consumer integration (splat, attention, gradient)**: §6 + §10.7-9 (consumer-specific paths). The codec mechanism is general; the consumer-side configuration is where domain knowledge lives.
+
+**Cite from a PR description**: cite by epiphany number (E-N) or holy grail number (H-N). The numbering is stable across edits to this doc.
+
+---
+
+_Last edit: 2026-05-22. Edit this doc when a new epiphany lands, a holy grail claim is falsified, or technical debt graduates from open → resolved. Renumber only by appending — never reuse a retired number._

From 4529c01f2c528e4ab80e309616c1a22404b2daeb Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 08:12:54 +0000
Subject: [PATCH 2/6] =?UTF-8?q?docs(codec):=20PR-X12=20merged-canon=20?=
 =?UTF-8?q?=E2=80=94=20synthesise=20session=20A=20+=20session=20B?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

621-line synthesis doc consolidating two parallel architecture threads
on PR-X12's predictive-coder substrate:

- Session A: pr-x12-codec-cognitive-substrate-mapping.md (cognitive
  shaders ↔ gradient ↔ HEVC mapping, this branch)
- Session B: pr-x12-cross-domain-synergies.md (x265 ⇄ 3DGS ⇄ attention
  ⇄ SGD, merged to master via PR #195 commit 01c77cc)

Both originals retained for archeology; merged-canon supersedes them
for new content. Structure:

- §1 overlap matrix (epiphanies, holy grails, plans, debt — unique vs
  shared between A and B)
- §2 architectural claim with two-route falsifiability
- §3 ten merge-only epiphanies (M:E-A..M:E-J) — the high-density
  section: universal LinearReduce<Basis> kernel, Morton+raster as
  space-filling curve, PR #195 CodeRabbit findings generalise to all
  four domains, ndarray-codec as 5th architectural category, three
  domain-specific plug-points only
- §4 unified holy grail list (A:H + B:HG + M:H + 2 new M:H-NEW)
- §5 unified plan with resolved critical path (B's A7-first wins,
  front-load A4-design)
- §6 sequencing DAG
- §7 unified technical debt (P0/P1/P2/P3 reranked across both docs)
- §8 resolves 4 disagreements between A and B (each with winner +
  rationale)
- §9 read order for the 4-doc family
- §10 compaction-preservation contract (8 invariants)

Key merge-only outputs:

- Three load-bearing traits: PredictiveSignal, LinearReduce<Basis>,
  CurveOrder<const N>
- M:H-NEW-1: single Rust binary consumes video|splat|gradient|kv-cache
  → compressed Lance column (the falsifiability test, Plan G builds it)
- M:H-NEW-2: ~1.5 KLoC generic + ~200 LoC per domain = ~2 KLoC for
  four industries (25× code-density vs per-domain impls)

Three new integration plans:

- Plan G: multi-domain bench harness (2 weeks, MUST ship before A7)
- Plan H: extract `ndarray-codec` crate (3 days, resolves dep-cycle)
- Plan I: trait PredictiveSignal in protocol (3 days, bit-pattern
  isomorphism with splat without code merge)

Citation IDs preserved from both originals: cite as (A:E-N), (B:E-N),
(M:E-X). No code changes; pure synthesis. Cargo check clean.
---
 .../pr-x12-substrate-merged-canon.md          | 621 ++++++++++++++++++
 1 file changed, 621 insertions(+)
 create mode 100644 .claude/knowledge/pr-x12-substrate-merged-canon.md

diff --git a/.claude/knowledge/pr-x12-substrate-merged-canon.md b/.claude/knowledge/pr-x12-substrate-merged-canon.md
new file mode 100644
index 00000000..7ff8d1ca
--- /dev/null
+++ b/.claude/knowledge/pr-x12-substrate-merged-canon.md
@@ -0,0 +1,621 @@
+# PR-X12 — Substrate Merged Canon
+
+> Date: 2026-05-22  
+> Status: **MERGED CANON** — synthesises two parallel sessions' findings into one doc  
+> Supersedes (for new content; keep originals for archeology):  
+>   - `pr-x12-codec-cognitive-substrate-mapping.md` (session A: opus 4.7 main thread, this branch)  
+>   - `pr-x12-cross-domain-synergies.md` (session B: parallel thread, PR #195 branch, commit `01c77ccc`)  
+> Sister doc: `pr-x12-codec-x265-design.md` (the mechanical spec, untouched)
+
+---
+
+## 0. Why this merge exists
+
+Two independent sessions reached the same architectural claim — *PR-X12 is the universal predictive-coder substrate that subsumes four industries* — through different routes. Each session surfaced angles the other missed. This doc is the **canonical fusion**, designed to be the single doc a fresh agent reads to inherit the entire claim.
+
+The merge is not a re-statement. **It is the new epiphanies that emerge only when both halves sit side by side.** They get their own §3.
+
+### Identity-preservation rules
+
+- Both originals' citation IDs survive. Cite by `(A:E-N)` for session-A epiphany N, `(B:E-N)` for session-B epiphany N, `(M:E-N)` for new merge-only epiphany N. Same for holy grails (`A:H-*`, `B:HG-*`, `M:H-*`) and debt (`A:T-*`, `B:D-CODEC-*`, `B:D-STACK-*`, `M:T-*`).
+- **Numbering stable across edits**: append-only, never reuse retired IDs.
+
+---
+
+## 1. Side-by-side overlap & unique-angle inventory
+
+The two docs overlap ~30% on the surface claim and diverge sharply on emphasis. The matrix below maps every load-bearing item.
+
+### 1.1 Epiphanies — the union
+
+| Concept | Session A | Session B | Status |
+|---|---|---|---|
+| Skip/Merge/Delta/Escape ≡ ZeRO buckets + LoRA | E-1 | (implicit in § 2 + E8) | **A more explicit on ZeRO; B more explicit on LoRA** |
+| CTU quad-tree ≡ attention hierarchy | E-2 | (implicit in § 3.1) | A more explicit |
+| K-means at frame rate = HEVC SCC unlock | E-3 | — | **A-unique** (2013-hardware history framing) |
+| Transform basis IS optimizer preconditioner | E-4 | — | **A-unique** (DCT-II ↔ Adam ↔ KFAC ↔ learned conv = same `Δ' = B·Δ`) |
+| rANS + k-means = Shannon-optimal grad compression | E-5 | E6 (rANS L1-cache scaling) | A more theoretical; B more pragmatic |
+| λ-RDO is universal training objective | E-6 | (implicit, § 5 Plan B) | A more explicit |
+| Block-matched ME via i8gemm | E-7 | — | **A-unique** |
+| CTU partition as tropical-GEMM | E-8 | — | **A-unique** |
+| CABAC → tiny transformer | E-9 | — | **A-unique** |
+| Deblocking + SAO as learned conv | E-10 | — | **A-unique** |
+| Palette codebook ≡ MKL k-means | E-11 | (implicit § 5 Plan E) | A more explicit |
+| Mode discriminants pin wire codes | E-12 | (in D-CODEC-10) | A frames as invariant; B as future-debt |
+| Basin codebook IS rANS frequency table | E-13 | (implicit § 5 Plan A) | A more explicit |
+| Reserved header bits 14-15 are inter-tier link | E-15 | (D-CODEC-10) | **A-unique** (concrete bit-budget plan) |
+| **MergeDir is topology, not direction** | — | E1 | **B-unique** (carrier-agnostic claim) |
+| **`predict_intra` encodes attention sinks** | (implicit in §3.2) | E2 | **B more explicit** — names Streaming-LLM, H2O, SnapKV |
+| **`escape_next` IS all-reduce slot allocator** | (implicit T-PR195-1) | E3 | **B-unique** — the bug-as-feature reframing |
+| **Fingerprint ≡ 3DGS first-6-floats after basin** | — | E4 | **B-unique** — bit-level identity claim |
+| **Morton sort ≡ HEVC raster scan** | — | E5 | **B-unique** — 1D-along-curve equivalence |
+| **splat3d × codec = same pipeline shifted 90°** | — | E9 | **B-unique** — mode-decide+reduce shared kernel |
+| **Lossy Escape IS the PSNR knob** | T-4 (debt) | E10 (feature) | A frames as debt; B frames as feature |
+
+### 1.2 Holy grail claims — the union
+
+| Claim | Session A | Session B |
+|---|---|---|
+| PR-X12 + cam_pq = HEVC SCC done right | H-1 | (implicit § 7 HG5) |
+| Transform IS optimizer | **H-2** | — |
+| CTU quad-tree = universal hierarchical attention | H-3 | (subsumed by HG3) |
+| rANS + k-means = Shannon-optimal | H-4 | (subsumed by § 7) |
+| PR-X12 generalises ZeRO | H-5 | HG4 (federated SGD 8-16×) |
+| 64×64 CTU right for both 4K video and 7B LLMs | H-6 | — |
+| Codec is the substrate, rest is renaming | H-7 | HG1 (one codec, four loads) |
+| Sub-1-bit/Gaussian 3DGS compression | — | **HG2** |
+| Bit-exact attention with tunable accuracy floor | — | **HG3** |
+| Lance substrate identity becomes ground truth | (T-16 cross-ref) | **HG5** |
+| splat3d × x265 = one library | — | **HG6** |
+
+### 1.3 Integration plans — the union
+
+| Plan | Session A | Session B | Effort |
+|---|---|---|---|
+| A4 transform (`Transform` trait + DCT-II + Identity) | §10.2 | (D-CODEC-3 indirectly) | 1 week |
+| A6 RDO (λ-weighted) | §10.3 | Plan B-indirect, § 5 RDO | 1 week |
+| **A7 rANS** | §10.4 | **Plan A** ← *critical path per B* | 1.5 weeks |
+| A8 stream framing | §10.5 | D-CODEC-4 | 1 week |
+| A3-inter cross-tier prediction | §10.6 | **Plan B** | 0.5-1 week |
+| Splat consumer integration | §10.7 | **Plan E** (more detailed) | 0.5-3 weeks |
+| Cognitive shader consumer (NARS) | §10.8 | **Plan D** (attention codec) | 1-2 weeks |
+| Gradient compression (burn/candle) | §10.9 | **Plan F** | 2-4 weeks |
+| **EWA SYRK-batched (3DGS perf)** | — | **Plan C** | 1 week |
+| **Carrier-agnostic topology trait** | — | X1 | sprint |
+| **AMX TDPBF16PS for batched EWA sandwich** | — | X8 | sprint |
+
+### 1.4 Technical debt — the union
+
+A's 23 items (T-1..T-23) and B's 23 items (D-CODEC-1..10 + D-STACK-1..13) overlap on:
+- PR #195 CodeRabbit findings (A:T-1, A:T-2 ≡ no B equivalent but acknowledged in §6 D-CODEC-6)
+- A3-inter not yet shipped (A:T-5, A:T-6 ≡ B:D-CODEC-1)
+- No SIMD-batched encode (A:T-7 ≡ B:D-CODEC-8)
+- Cross-repo dep direction problem (A:T-16, A:T-17 ≡ B:D-STACK-6, D-STACK-12)
+- No `Transform` trait yet (A:T-11 ≡ B's Plan A→C dependency)
+
+But B surfaces things A missed:
+- **B:D-STACK-1** — `BlockedGrid` 64×64 vs splat3d tile 16×16 mismatch. **Real P1 issue.**
+- **B:D-STACK-7** — `lance-graph-contract/src/splat.rs` is **sacred, do not touch even if bit patterns rhyme**. Architectural invariant.
+- **B:D-STACK-11** — AVX-512 mandatory in `.cargo/config.toml` conflicts with multi-architecture federated SGD.
+- **B:D-STACK-13** — No multi-domain benchmark harness — HG1 is unproven without it.
+
+And A surfaces things B missed:
+- **A:T-22** — causal-edge v2 metadata (Intervention/Counterfactual mantissa) can flow through reserved header bits for free.
+- **A:T-19** — GridLake `#[derive(SoA)]` macro never shipped; the codec batched-encode path will want it.
+- **A:T-3** — first-fit Merge policy needs RDO replacement, λ=0 default.
+
+---
+
+## 2. Synthesis — what both docs collectively prove
+
+When both halves sit side by side, the **architectural claim sharpens past either individual doc**:
+
+> PR-X12 is not a video codec, not a gradient compressor, not an attention sparsifier, not a splat compressor. **It is a `trait PredictiveSignal` that all four implement.** The codec mechanism is one piece of generic glue (~1 KLoC); each domain ships ~200 LoC of trait impl. The total stack code for all four industries is ~2 KLoC, versus ~50 KLoC of per-domain implementations elsewhere.
+
+This claim survives only because both docs independently converged on it from different routes (mode-coding semantics in A, primitive-mapping matrix in B). One source could be mistaken; two independent routes is the falsifiability that makes it actionable.
+
+---
+
+## 3. New epiphanies — side-by-side only
+
+These are the insights that emerge **only when both docs sit next to each other**. None appear in either original. Cite as `M:E-A` through `M:E-J`.
+
+### M:E-A — Mode-decide + reduce IS the universal pipeline kernel
+
+A's E-4 (transform IS optimizer preconditioner) + B's E9 (splat3d × codec = same pipeline shifted 90°) combined:
+
+The *reduction operator* in B's "unified mode-decide+reduce trait" is **exactly the basis-times-source product** A's transform claim points at:
+- Alpha-composite (3DGS) = `α @ src` (degenerate basis)
+- rANS-encode = `freq_table @ symbols`
+- Sum-reduce (SGD all-reduce) = `1ᵀ @ src` (constant basis)
+- Softmax attention = `softmax(QKᵀ) @ V`
+
+All four reductions are **matrix-vector products with different basis choices**. The `Transform` trait isn't just optimizer preconditioning — it's the universal reduce-op. **A6 RDO + A4 transform + A7 rANS + the splat3d alpha-composite all share one trait surface**. The split between "transform" and "reduce" is artificial; they're the same operator with different basis matrices.
+
+→ Action: design `trait LinearReduce<Basis>` as the unifying surface. Issue in A4 design phase, not later. Mark as the **load-bearing trait of PR-X12** alongside `PredictiveSignal`.
+
+### M:E-B — Morton ≡ raster scan = 1D-along-curve predictive coder
+
+B's E5 (Morton sort = HEVC raster scan) combined with A's CTU-quad-tree-is-attention claim:
+
+Both 3DGS Morton/Hilbert traversal and HEVC z-order raster scan are *space-filling curves*. After the curve, every signal is **a 1D stream of locally-coherent values**. The CTU partition machinery is genuinely 1D-aware (8-neighbour matters; full 2D distance doesn't). 
+
+→ This means the CTU code is **dimensionally generic** if we pre-sort by a space-filling curve. 3DGS at depth 3 = 8 cells per leaf = an 8-cell window along a Morton curve. Cognitive cells at depth 3 = 8 cells per leaf along a raster scan. **Same kernel, different curve.**
+
+→ Action: factor out `trait CurveOrder<const N: usize>` — depth-3 leaves are always N=8 windows along the curve. The codec is dimension-agnostic at the type level, the curve choice is a runtime detail.
+
+### M:E-C — The CodeRabbit PR #195 findings generalise to every domain
+
+A's T-1 (BASIN_NONE collision) and T-2 (unwrap_or non-bijection) seem like local cleanups. Combined with B's four-load framing (§2), they reveal:
+
+- **BASIN_NONE collision** → "highest valid attention-vocabulary token collides with no-token sentinel" — same bug, same fix, different domain. Will fire in every consumer that uses the full 4096-entry codebook.
+- **unwrap_or non-bijection** → "malformed gradient becomes silently-zero gradient." Same bug shape, *very different impact*: a zeroed gradient = a frozen parameter = silent training-data corruption with no error signal.
+
+**The PR #195 fixes are not local — they're the **right design pattern** for all four loads.** Defensive `unwrap_or` is acceptable in a video codec where decode-time produces a valid (if wrong) pixel; in gradient compression it's a silent loss-function corruption.
+
+→ Action: when reviewing future consumer PRs, audit for *both* fixes by default. Add a clippy lint or test pattern that catches the `unwrap_or` shape in encode paths.
+
+### M:E-D — The third crate that breaks ndarray ↔ lance-graph cycle IS the codec
+
+A's T-16/T-17 (cross-repo dep direction problem) + B's D-STACK-6/D-STACK-12 (Lance substrate as ground truth) both flag the architectural tension: ndarray is dep-bottom; lance-graph-as-substrate would require ndarray → lance-graph, breaking the layering rule.
+
+The resolution is **already implicit** in the merged claim: after PR-X12 stabilises, extract `crate::hpc::codec::*` into a sibling crate `ndarray-codec`. Both `ndarray` and `lance-graph` then depend on it. The codec lives at the dep-bottom layer not as "ndarray hardware" but as **its own architectural category**.
+
+→ Action: add a **fifth category** to the architecture rule in CLAUDE.md:
+```
+- ndarray = hardware (SIMD, Palette, Base17, SpoDistanceMatrices, read_bgz7_file)
+- ndarray-codec = compression substrate (Ctu, LeafCu, predict_intra, rANS) ← NEW
+- lance-graph = thinking (NarsTruth, NarsEngine, TripleModel, AutocompleteCache)
+- causal-edge = protocol (CausalEdge64, NarsTables, forward/learn)
+- p64 = convergence highway (both repos meet here)
+```
+
+→ Plan **Integration H** (below): extract the codec crate. Pre-condition for HG5.
+
+### M:E-E — `Transform` trait is the **only** domain-specific surface
+
+A's E-4 (DCT-II ↔ Adam ↔ KFAC ↔ learned conv) + B's E1 (MergeDir is topology-free):
+
+Once you accept that `MergeDir`'s 4-way alphabet is topology-free (B), and the `Transform` basis is the universal `Δ' = B·Δ` operator (A), the question "what's domain-specific in the codec?" has a precise answer: **only the Transform impl + the curve order + the escape payload type.**
+
+- Mode decision (Skip/Merge/Delta/Escape) = domain-agnostic
+- Basin codebook k-means = domain-agnostic (just need the right metric)
+- rANS = domain-agnostic
+- Stream framing = domain-agnostic
+- Transform basis = **domain-specific** (Identity / DCT-II / Adam / KFAC / SH-spectral / learned)
+- Curve order = **domain-specific** (raster / Morton / token-seq / layer-seq)
+- Escape payload type = **domain-specific** (u64 / SH-coefficients / f16 vector / f32 grad)
+
+**Three plug-points, otherwise generic.** This is the cleanest factorisation. Splat consumers ship one `impl Transform for SHCoeffBasis`; that's it.
+
+→ Action: every per-domain consumer PR (Plan D, E, F) must touch only those three surfaces. If a consumer needs to modify the mode decision or the rANS, that's a sign the abstraction is wrong — escalate.
+
+### M:E-F — Sequencing resolution: B's A7-first wins, but only by one knight-move
+
+A's §10 sequenced A4 transform first ("unlocks H-2"); B's §10 sequenced A7 rANS first ("without it the codec is academic"). Side-by-side, B is right:
+
+- A4 with Identity transform = no change to compression ratio (transform is a 1-2× refinement)
+- A7 alone = 3× → 8-10× compression ratio (rANS is the entropy floor)
+- Therefore: ship A7 with Identity transform first → 8-10× ratio; then ship A4 (DCT-II) → 12-15×; then A6 RDO → 18-20×.
+
+**But** — A's H-2 (transform IS optimizer) only becomes citable once A4 ships with the trait shape. If we ship A7 first with hardcoded identity, then need to refactor to add the `Transform` trait when A4 lands, the trait design happens **at A4 time** without the demand pressure of an actually-used trait.
+
+→ Resolution: ship A7 first **but only after A4's `Transform` trait surface is designed** (not implemented — just the trait shape committed to). A4's design phase is one day; A7's implementation is 1.5 weeks. Front-load the design.
+
+### M:E-G — `Ctu<const N: usize>` is the right block-size answer
+
+B's D-STACK-1 (BlockedGrid 64×64 vs splat3d 16×16 mismatch) is real and P1. The resolution is **type-level**:
+
+```rust
+pub struct Ctu<const N: usize = 64> {
+    block_row: u16,
+    block_col: u16,
+    tier: NonZeroU16,
+    split_depth: u8,
+    arena: CtuArena<N>,
+}
+
+type CtuVideo = Ctu<64>;     // Cognitive cells, HEVC
+type CtuSplat = Ctu<16>;     // 3DGS tiles
+type CtuHead = Ctu<8>;       // LLM attention heads
+```
+
+Const-generic over N. `MAX_QUAD_TREE_NODES` becomes a const fn. Block-size mismatch dissolves into a type-level configuration. **No code duplication; no runtime branching.**
+
+→ Action: introduce as part of A4 (since A4 will need basis sizing). Mark as **prerequisite for Plan E** (3DGS codec).
+
+### M:E-H — D-STACK-13 (multi-domain bench harness) is the highest-leverage debt
+
+Across all 46 numbered debt items in both docs, exactly one is unfalsifiable without code: B's D-STACK-13. Without a single-binary four-loads benchmark, *the entire architectural claim is unproven*. Every other debt item degrades performance or correctness; this one degrades **confidence**.
+
+→ The bench harness is the **proof** of HG1 / HG6 / M:H-NEW-1. Build it **before** A7 ships. Two weeks of bench-harness work front-loaded saves six months of "we implemented A7 and then realised the trait shape was wrong."
+
+→ Action: **Integration Plan G** (below): the bench harness is sub-card A0 of PR-X12. Renumber the sub-cards. A0 = bench harness. A1-A8 as before.
+
+### M:E-I — `lance-graph-contract/src/splat.rs` and `Fingerprint` are isomorphic but must not fold
+
+B's D-STACK-7 says: never touch `lance-graph-contract/src/splat.rs`, even if bit patterns rhyme.  
+B's E4 says: `Fingerprint` ≡ 3DGS-first-6-floats bit-level identity exists.
+
+The resolution: ship a shared **trait**, not shared **code**.
+
+```rust
+// Lives in ndarray-codec (the new crate per M:E-D), or in a *protocol* crate
+pub trait PredictiveSignal {
+    type Basin;
+    type Residual;
+    type Escape;
+    
+    fn nearest_basin(&self, codebook: &[Self::Basin]) -> (u16, Self::Residual);
+    fn fits_delta(residual: &Self::Residual) -> bool;
+    fn pack_residual(residual: &Self::Residual) -> u8;
+}
+```
+
+`impl PredictiveSignal for Fingerprint` lives in `lance-graph::cognitive`. `impl PredictiveSignal for GaussianSplat` lives in `ndarray::hpc::splat3d`. **Bit-pattern identity is proven by both impls; code identity is rejected by the architecture rule.** The trait isomorphism gives us what E4 wants without violating D-STACK-7.
+
+→ This is the cleanest resolution of the apparent contradiction between the two docs.
+
+### M:E-J — The reserved header bits 14-15 carry causal-edge metadata for free
+
+A's E-15 (reserved bits 14-15 are inter-tier link) + A's T-22 (causal-edge v2 mantissa: Intervention=+6, Counterfactual=-6):
+
+Two reserved bits = 4 states. The natural 4-state encoding for cognitive content:
+- 00 = Observation (rung 1)
+- 01 = Intervention (rung 2)
+- 10 = Counterfactual (rung 3)
+- 11 = Reserved / inter-tier link (the original E-15 use)
+
+**Pearl-rung causal direction rides in the codec's wire format for free**, with the inter-tier link as the 4th state. The cognitive consumer (Plan D) doesn't need to extend `LeafCu`; it gets causal metadata via the 16-bit header's high 2 bits.
+
+→ Action: document this in A3-inter design (Plan B). Don't ship until the cognitive consumer needs it; reserve the bits explicitly so the wire format doesn't pin them to a different meaning later.
+
+---
+
+## 4. Unified holy grail list (canonical)
+
+Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the merge.
+
+### Combined load-bearing claims
+
+**M:H-1** *(merge of A:H-7 + B:HG1)* — The codec is the substrate; all four loads (video, 3DGS, attention, gradient) are renamings sharing one trait surface. ~2 KLoC of generic glue + ~200 LoC per domain consumer = the entire stack for all four.
+
+**M:H-2** *(from A:H-2 alone)* — The transform basis IS the optimizer's preconditioning matrix. The most underrated single claim. Resolves the disconnect between codec research (where "transform" is central) and ML research (where "preconditioner" is central) — same operator, two names, no cross-citation in either literature.
+
+**M:H-3** *(merge of A:H-3 + B:HG3)* — Bit-exact attention with tunable accuracy floor via Skip/Merge/Delta/Escape over (Q,K) palette. The accuracy knob is a single `u8` threshold. Subsumes Streaming-LLM, H2O, SnapKV as configuration cases.
+
+**M:H-4** *(from A:H-4 alone, with B:E6 nuance)* — rANS + k-means achieves Shannon-optimal lossless gradient compression. Every published gradient-compression scheme (QSGD, signSGD, PowerSGD, Top-K, Random-K) is a special case with a particular frequency table and basis. Confirmed by B:E6 (rANS's L1-cache throughput dominance).
+
+**M:H-5** *(merge of A:H-5 + B:HG4)* — PR-X12 generalises ZeRO with Merge providing the inter-parameter correlation dimension ZeRO can't capture. Federated SGD at 8-16× compression with zero accuracy loss when worker count > 16 (Merge becomes dominant).
+
+**M:H-6** *(from B:HG2 alone)* — Sub-1-bit-per-Gaussian 3DGS compression. 30-60× over current state-of-the-art PLY-trim. A 1M-Gaussian scene = ~500 KB, streamable as video. **Most economically valuable single claim** — directly attacks the bandwidth bottleneck for cloud-rendered 3D content.
+
+**M:H-7** *(merge of A:H-1 + B:HG5)* — Lance column substrate identity becomes ground truth. `SpoDistanceMatrices` at 611M lookups/sec serves as universal palette codebook lookup across all four loads. ndarray = hardware, ndarray-codec = compression substrate (new, per M:E-D), lance-graph = thinking, causal-edge = protocol, p64 = convergence. Five-category architecture.
+
+**M:H-8** *(from A:H-6 alone)* — 64×64 CTU is the right unit for both 4K video luma blocks and 7B-parameter LLM head dim × 16 heads. Convergent evolution from two unrelated industries arriving at the same arithmetic block size.
+
+**M:H-9** *(from B:HG6 alone)* — splat3d × x265 = one library: compress, stream, decode, render 3D scenes in real-time on a single core. **Demo-worthy** — pick one Mip-NeRF 360 scene, compress with PR-X12 at A8 land time, stream via WebRTC, decode + render via splat3d. Single Rust binary; ~5 MB total binary size.
+
+### M:H-NEW — claims born from the merge
+
+**M:H-NEW-1** — The same Rust binary consumes (4K video frames | 1M-Gaussian 3DGS scene | 7B-LLM gradient stream | attention KV cache) and emits a compressed Lance column. One CLI. One codec. Four loads. **This is the falsifiability test** — build it (Plan G, the bench harness), prove HG1/H-7 by demonstration, not by argument.
+
+**M:H-NEW-2** — `trait PredictiveSignal` + `trait LinearReduce<Basis>` + `trait CurveOrder<const N: usize>` factor the codec into three plug-points (per M:E-E + M:E-A + M:E-B). The codec body is `<150 LoC of generic glue. Domain consumers ship `<200 LoC` of trait impls. **Total stack for all four industries: ~2 KLoC.** Compared to ~50 KLoC per-domain implementations elsewhere. The 25× code-density delta IS the architectural payoff that justifies the eight sub-cards.
+
+---
+
+## 5. Unified integration plan (canonical sequencing)
+
+Replaces both A:§10 and B:§5 plan lists. Critical path resolved per M:E-F.
+
+### Phase 0 — substrate (must ship before consumer PRs)
+
+**Plan G** *(new — from M:E-H)*: Multi-domain benchmark harness  
+**Effort:** 1 worker × 2 weeks  
+**Output:** `crates/codec-bench/` — single binary that ingests video / 3DGS / KV cache / gradient stream and emits compressed Lance columns + ratio + reconstruction error.  
+**Why first:** unfalsifiable architecture claim becomes falsifiable. Drives trait design.  
+**Pre-condition for:** every other plan.
+
+**Plan A4-design** *(from M:E-F resolution)*: `Transform` trait shape committed (not implemented)  
+**Effort:** 1 worker × 1 day  
+**Output:** PR introducing `trait Transform { fn apply(&[i8;N])->[i8;N]; fn invert(...); }` + Identity default impl + DCT-II stub.  
+**Why now:** A7 design will reference the trait; cheaper to commit the shape upfront.
+
+**Plan H** *(new — from M:E-D)*: Extract `ndarray-codec` crate  
+**Effort:** 1 worker × 3 days  
+**Output:** `crate::hpc::codec::*` moves to sibling `ndarray-codec` crate. Both `ndarray` and `lance-graph` depend on it.  
+**Why now:** Resolves cross-repo dep tension before HG5 / HG7 consumers land. Lower cost while the codec is still small (~1.5 KLoC).  
+**Architectural impact:** Update CLAUDE.md "Architecture Rule" to add the codec as 5th category.
+
+**Plan I** *(new — from M:E-I)*: `trait PredictiveSignal` in protocol crate  
+**Effort:** 1 worker × 3 days  
+**Output:** Shared trait + impls for `Fingerprint` (cognitive), `GaussianSplat` (3DGS), `AttentionSlot` (KV cache), `GradientWeight` (SGD).  
+**Why now:** Resolves the "sacred splat.rs" + "Fingerprint ≡ 3DGS-first-6-floats" tension via trait isomorphism (D-STACK-7 + E4).
+
+### Phase 1 — codec mechanism completion
+
+**Plan A** *(from B:Plan A — critical path)*: A7 rANS  
+**Effort:** 1 worker × 1.5 weeks  
+**Output:** `src/hpc/codec/ans.rs` (or in `ndarray-codec` after Plan H) — encoder + decoder + parity test.  
+**Compression unlock:** 3× → 8-10×.
+
+**Plan B** *(from B:Plan B + A:§10.6)*: A3-inter cross-tier neighbour scan  
+**Effort:** 1 worker × 3-5 days  
+**Output:** `IntraContext` (rename `PredictContext`) gains parent-tier + child-tier slots. Uses A's E-15 reserved bits 14-15. Reserves M:E-J causal-edge metadata encoding.  
+**Compression unlock:** +20-30% for hierarchical content (3DGS LOD cascades, attention layer-merge).
+
+**Plan A4-impl** *(from A:§10.2)*: A4 transform implementation  
+**Effort:** 1 worker × 1 week  
+**Output:** DCT-II 4×4/8×8 impls + batched dispatch to `bf16_tile_gemm` at ≥64 blocks.  
+**Depends on:** Plan A4-design (already shipped).  
+**Compression unlock:** +30-50% for spectrally-smooth content.
+
+**Plan A6** *(from A:§10.3 + B's RDO note)*: λ-RDO  
+**Effort:** 1 worker × 1 week  
+**Output:** `RdoConfig::lambda: f32` + `trait RdoMetric` (M:E new) for per-domain rate-distortion shape (PSNR / MSE / downstream-loss / KL).  
+**Compression unlock:** Configurable rate/distortion tradeoff.
+
+**Plan A8** *(from A:§10.5 + B:Plan A8)*: Stream framing  
+**Effort:** 1 worker × 1 week  
+**Output:** Frame headers, CTU markers, per-frame basin codebook serialisation, per-frame rANS frequency table.
+
+### Phase 2 — consumer integrations
+
+**Plan C** *(from B:Plan C — splat performance)*: EWA SYRK-batched  
+**Effort:** 1 worker × 1 week  
+**Output:** `crate::hpc::splat3d::spd3` swaps per-Gaussian loop for batched `cblas_ssyrk`. Backend dispatch (native / intel-mkl / openblas).  
+**Why parallel to Phase 1:** No codec dependency; can ship anytime.
+
+**Plan E** *(from B:Plan E — most impactful consumer)*: 3DGS coefficient codec  
+**Effort:** 2 workers × 3 weeks  
+**Output:** `crate::hpc::splat3d::codec` — Morton-sort, per-asset palette, mode-code, rANS.  
+**Depends on:** Plan A (rANS), Plan B (A3-inter), Plan I (`PredictiveSignal`).  
+**Unlock:** M:H-6 (sub-1-bit/Gaussian), M:H-9 (splat3d × x265 demo).
+
+**Plan D** *(from B:Plan D — attention codec)*: Attention KV cache compression  
+**Effort:** 2 workers × 2 weeks  
+**Output:** `crates/attention-codec/` consuming `ndarray-codec`.  
+**Depends on:** Plan A, Plan I.  
+**Unlock:** M:H-3 (bit-exact attention with knob).
+
+**Plan F** *(from B:Plan F — gradient compression)*: Federated SGD  
+**Effort:** 2 workers × 4 weeks  
+**Output:** `crates/grad-codec/`. Generalised `&mut u32` allocator across worker pools (per B:E3).  
+**Depends on:** Plan A, Plan B, Plan I, multi-arch dispatch (resolves D-STACK-11).  
+**Unlock:** M:H-5 (ZeRO generalisation, 8-16× compression).
+
+### Phase 3 — exploration / research
+
+(Both docs' X-paths merged, ranked by confidence)
+
+| Path | Source | Effort | Status |
+|---|---|---|---|
+| Carrier-agnostic 4-neighbour topology trait | B:X1 | sprint | **Subsumed by Plan I + M:E-B** |
+| Hierarchical motion estimation as cross-tier prediction | B:X2 | sprint | After Plan B |
+| CABAC vs rANS for attention KV cache | B:X3 | bench | Pre-A7 (informs Plan A) |
+| SH coefficient intra-prediction in spectral space | B:X4 | research | After Plan E |
+| Mode-coded LoRA | B:X5 + A:E8 | research | After Plan D — Qwen3.5-7B controlled experiment |
+| Unified mode-decide + reduce trait | B:X6 + M:E-A | sprint | **Promoted to Plan I extension** |
+| Lance column-substrate as universal palette codebook | B:X7 + M:E-D | sprint | After Plan H |
+| AMX TDPBF16PS for batched EWA sandwich | B:X8 | sprint | After Plan C |
+| CABAC replacement with tiny transformer | A:E-9 | 1-2 weeks | After Plan A — bleeding-edge compression |
+| CTU partition as tropical-GEMM | A:E-8 | 1 week | After Plan A6 (cross-repo: needs lance-graph::blasgraph) |
+| Deblocking + SAO as learned conv | A:E-10 | 1 week | Optional refinement; not on critical path |
+| Block-matched ME via i8gemm | A:E-7 | 1 week | Pre-shipped (already in scope) |
+| Palette codebook training as MKL k-means | A:E-11 | shipped | Already in `cam_pq::kmeans` |
+
+---
+
+## 6. Sequencing diagram
+
+```
+              ┌──────────────────────────────────────┐
+              │   Plan G (multi-domain bench)        │
+              │   2 weeks — UNFALSIFIABILITY GATE    │
+              └──────────────────┬───────────────────┘
+                                 ▼
+              ┌─────────────────────────────────────┐
+              │  Plan H (extract ndarray-codec)     │  ← parallel
+              │  3 days — DEP-CYCLE RESOLUTION       │
+              └─────────────────┬───────────────────┘
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  Plan I (PredictiveSignal trait)    │  ← parallel
+              │  3 days — TRAIT ISOMORPHISM          │
+              └─────────────────┬───────────────────┘
+                                ▼
+              ┌─────────────────────────────────────┐
+              │  Plan A4-design (Transform trait    │  ← parallel
+              │  shape, 1 day)                       │
+              └─────────────────┬───────────────────┘
+                                ▼
+            ╔═══════════════════════════════════════╗
+            ║  Plan A (A7 rANS) — CRITICAL PATH     ║
+            ║  1.5 weeks — COMPRESSION FLOOR        ║
+            ╚═══════════════════╦═══════════════════╝
+                                ▼
+            ┌──────┬─────┬─────┬─────┐
+            ▼      ▼     ▼     ▼     ▼
+       Plan B   A4   A6   A8   Plan C (EWA SYRK, parallel)
+       (inter) (xfm)(RDO)(stream)
+            └──────┬─────┴─────┴─────┘
+                   ▼
+            ┌──────┴──────┐
+            ▼             ▼
+       Plan E (3DGS)  Plan D (attention)
+                          │
+                          ▼
+                     Plan F (gradient SGD)
+                          │
+                          ▼
+                ┌─────────────────┐
+                │ M:H-1 .. M:H-9  │
+                │ All unlocked    │
+                └─────────────────┘
+```
+
+**Critical path: Plan G → Plan A**. Without the bench harness, A7 ships blind. Without A7, no compression claim is testable.
+
+**Parallel paths post-A7**: A4 / A6 / A8 + Plan C can ship in parallel. Plan E / D / F gate on Plan I.
+
+---
+
+## 7. Unified technical debt (canonical)
+
+Combines A's T-1..T-23 + B's D-CODEC-1..10 + B's D-STACK-1..13 + new M:T-* items, deduplicated and re-ranked.
+
+### P0 (must address before claim)
+
+- **M:T-1**: No multi-domain bench harness — *the* unfalsifiability gate (per M:E-H). Source: B:D-STACK-13.
+- **B:D-CODEC-2**: A7 rANS unwritten — without it, ratio claim is academic.
+- **B:D-STACK-7**: `lance-graph-contract/src/splat.rs` sacred file — never touch even if bit patterns rhyme (per M:E-I).
+
+### P1 (must address before consumer PRs land)
+
+- **A:T-1, A:T-2**: PR #195 CodeRabbit findings (BASIN_NONE collision + unwrap_or non-bijection). Generalise per M:E-C.
+- **B:D-CODEC-1**: A3-inter unwritten — gates hierarchical compression. Source: A:T-5/T-6.
+- **B:D-CODEC-3**: λ-RDO unwritten — gates accuracy/compression knob.
+- **B:D-CODEC-5**: Basin codebook not built — gates every consumer. Resolved by Plan I + `cam_pq` integration.
+- **B:D-STACK-1**: BlockedGrid 64×64 vs splat3d 16×16 mismatch — resolved by M:E-G (`Ctu<const N>`).
+- **B:D-STACK-2**: Basin codebook lookup has no SIMD path — gates encoder throughput at ~10⁵ CTU/sec.
+- **B:D-STACK-3**: `MergeDir` wire-pinned to 4-way — gates topology generalisation per B:X1 / M:E-B.
+- **B:D-STACK-6, B:D-STACK-12**: Cross-repo dep direction — resolved by Plan H. Source: A:T-16, T-17.
+- **B:D-STACK-11**: AVX-512 mandatory in `.cargo/config.toml` — gates multi-arch federated SGD (Plan F).
+- **M:T-2** *(new)*: No `trait LinearReduce<Basis>` yet — gates M:E-A unification. Plan I extension.
+- **M:T-3** *(new)*: Architecture rule in CLAUDE.md lacks "codec" 5th category — gates M:H-7 / M:E-D. One-line edit.
+
+### P2 (fix in follow-up)
+
+- **A:T-3**: A3-intra first-fit policy replaced by RDO when A6 lands.
+- **A:T-7**: No SIMD-batched encode — deferred until reference + reconstruction parity (also B:D-CODEC-8).
+- **A:T-13**: `bf16_tile_gemm` NEON impl stub — gates A4 batched dispatch on ARM.
+- **A:T-14**: No `Result`-returning encode API — needed by A6 RDO (also B:D-CODEC-9).
+- **A:T-15**: K-means u8 mode wrapper needed.
+- **A:T-19**: GridLake `#[derive(SoA)]` macro never shipped — wanted for batched encode path.
+- **A:T-22**: causal-edge v2 mantissa metadata can ride reserved bits — opportunity, not debt, per M:E-J.
+- **B:D-CODEC-7**: NEWS topology hard-coded — resolved by M:E-B trait.
+- **B:D-STACK-4**: `Fingerprint` 64-bit only — type-side; resolved by Plan I trait.
+- **B:D-STACK-5**: splat3d and codec don't yet share kernel — resolved by M:E-A `LinearReduce` trait.
+- **B:D-STACK-9**: Per-frame codebook lifetime varies per load — document discipline in Plan I.
+
+### P3 (cosmetic / docs)
+
+- A:T-4, T-8, T-9, T-10 (HPC graduation residue, doc cross-refs).
+- B:D-CODEC-6 (lossy Escape docstring — feature per B:E10 / A:T-4).
+- B:D-CODEC-10 (mode 2-bit cap, future "mode 5" upgrade path) — *Note: M:E-J consumes the reserved bits for Pearl-rung metadata, not mode 5.*
+- B:D-STACK-8 (no backend dispatch in codec yet).
+- B:D-STACK-10 (multi-week dependency tracking in coordinator agent).
+- A:T-23, M:T-3 (architecture rule update).
+
+### M:T new items (merge-only)
+
+- **M:T-1**: No multi-domain bench harness (P0, see above).
+- **M:T-2**: No `LinearReduce<Basis>` trait (P1, see above).
+- **M:T-3**: Architecture rule needs 5th category (P1, see above).
+- **M:T-4**: Documentation cross-references between sister docs (this doc + both originals + pr-x12-codec-x265-design.md) need a navigation page. Low effort (1 hour); easy to forget.
+- **M:T-5**: PR-X12's sub-card numbering needs renumber (A0 = bench harness, A1-A8 as before). Two-line README change.
+- **M:T-6**: `trait CurveOrder<const N>` not yet designed (per M:E-B) — gates dimension-agnostic CTU partition.
+- **M:T-7**: `trait RdoMetric` per-domain shape — gates A6 design (per Phase 1 plan).
+
+---
+
+## 8. Resolved disagreements between the two docs
+
+Side-by-side surfaced four points where A and B reached different conclusions. The merge resolves each:
+
+| Disagreement | Session A position | Session B position | **Merge resolution** |
+|---|---|---|---|
+| Critical path | A4 transform first | A7 rANS first | **B wins** — but front-load A4-design (M:E-F) |
+| Block size | (implicit) 64×64 fits all | (D-STACK-1) 64×64 conflicts with splat3d 16×16 | **B wins** — resolve via `Ctu<const N>` (M:E-G) |
+| Lossy Escape fallback | (T-4) debt | (E10) feature | **B wins** — promote to PSNR knob via λ-RDO |
+| Cross-repo dep direction | (T-16, T-17) note as debt | (D-STACK-6) propose third crate | **B wins, explicitly** — extract `ndarray-codec` (M:E-D, Plan H) |
+
+Plus one disagreement that the merge **upgrades into a new claim**:
+
+| Pseudo-disagreement | A side | B side | **Synthesis** |
+|---|---|---|---|
+| What's domain-specific in the codec? | Transform basis (H-2) | Curve order + Merge topology (E1, E5) + escape payload | **Both, plus a third surface** — M:E-E says exactly three plug-points; M:H-NEW-2 quantifies the code-density win |
+
+---
+
+## 9. How to navigate the four docs (read order)
+
+Future agents inheriting this work should read in this order:
+
+1. **`pr-x12-codec-x265-design.md`** (5 min) — the mechanical spec. What ships. What doesn't.
+2. **This doc** (`pr-x12-substrate-merged-canon.md`) (15 min) — the architectural claim + integration plan. Section 3 (merge-only epiphanies) is the highest-information density.
+3. **`pr-x12-codec-cognitive-substrate-mapping.md`** *(session A original)* — read **only** if you need A's framing for a specific claim (e.g., A:E-4 transform-IS-optimizer needs A's detailed argument). Otherwise superseded by §3 of this doc.
+4. **`pr-x12-cross-domain-synergies.md`** *(session B original)* — read **only** if you need B's framing (e.g., B:E5 Morton≡raster). Otherwise superseded.
+
+Cite using the canonical IDs:
+- Mechanism / module layout: cite `pr-x12-codec-x265-design.md` § X
+- Epiphanies: cite as `(A:E-N)`, `(B:E-N)`, or `(M:E-X)` per origin
+- Holy grails: `(A:H-N)`, `(B:HG-N)`, `(M:H-N)`, `(M:H-NEW-N)`
+- Plans: by letter — Plan A, B, C, D, E, F (B's) + Plan G, H, I (merge-only)
+- Debt: `(A:T-N)`, `(B:D-CODEC-N)`, `(B:D-STACK-N)`, `(M:T-N)`
+
+---
+
+## 10. Compaction-preservation contract
+
+When this doc is summarised across context windows, preserve at minimum:
+
+1. **The five-category architecture** with codec added (M:E-D / M:T-3).
+2. **The three plug-points** for domain specialisation (M:E-E): Transform basis, Curve order, Escape payload.
+3. **The critical path**: Plan G (bench) → Plan A4-design → Plan A (A7 rANS).
+4. **The sacred file invariant** (B:D-STACK-7): `lance-graph-contract/src/splat.rs` must never fold even if bit patterns rhyme.
+5. **The three new traits**: `PredictiveSignal` (M:E-I, Plan I), `LinearReduce<Basis>` (M:E-A, M:T-2), `CurveOrder<const N>` (M:E-B, M:T-6).
+6. **The four-load count**: cognitive cells, 3D Gaussian splats, transformer attention, distributed-SGD gradients. Not five (no quantum yet), not three (gradient compression is its own load even though it overlaps with attention).
+7. **The PR #195 fixes generalise** (M:E-C): BASIN_NONE collision + unwrap_or non-bijection patterns will fire in every consumer; audit by default.
+8. **Citation IDs stable**: don't renumber.
+
+---
+
+## 11. Cross-references
+
+### Sibling knowledge docs (read in order)
+
+- `pr-x12-codec-x265-design.md` — mechanical spec (companion, untouched)
+- `pr-x12-codec-cognitive-substrate-mapping.md` — session A original
+- `pr-x12-cross-domain-synergies.md` — session B original (merged to master via PR #195 in commit `01c77ccc`)
+
+### Adjacent design docs
+
+- `pr-x10-linalg-core-design.md` — linalg substrate (distance kernels, will host `LinearReduce` basis impls)
+- `pr-x1-design.md` + `pr-x2-design.md` — GridLake substrate (`MultiLaneColumn` — batched encode path's carrier)
+- `pr-x3-cognitive-grid-design.md` — `BlockedGrid` (CTU's parent type; needs const-generic refactor per M:E-G)
+- `pr-x4-design.md` — splat cascade (consumer of M:E-G + Plan E)
+- `cognitive-substrate-convergence-v1.md` (in lance-graph repo) — cross-repo locked spec; needs cross-reference back from §11
+
+### Hard rules (must respect)
+
+- `data-flow.md` — no `&mut self` during compute
+- `vertical-simd-consumer-contract.md` — W1a contract
+- CLAUDE.md "Architecture Rule" — to be amended per M:E-D / M:T-3
+
+### Code references (as of 2026-05-22)
+
+- `src/hpc/codec/ctu.rs` — A1 (shipped)
+- `src/hpc/codec/mode.rs` — A2 (PR #195, BASIN_NONE fix pending, see M:T A:T-1)
+- `src/hpc/codec/predict.rs` — A3-intra (PR #195, fits_i8 fix landed)
+- `src/hpc/cam_pq.rs` — k-means substrate
+- `src/hpc/bf16_tile_gemm.rs` — AMX path (NEON stub per A:T-13)
+- `src/simd_soa.rs` — `MultiLaneColumn` (batched encode carrier per A:T-19)
+
+### In flight
+
+- **PR #195** (A2 mode + A3-intra): two CodeRabbit findings open (A:T-1, A:T-2). Tracked also in B:D-CODEC's adjacent notes. This doc lives in a separate branch (`claude/continue-ndarray-x0Oaw`) and will land independently.
+
+---
+
+## 12. The single load-bearing sentence
+
+If you read nothing else:
+
+> *PR-X12 is a `PredictiveSignal` + `LinearReduce<Basis>` + `CurveOrder<const N>` factorisation that ships ~1.5 KLoC of generic codec glue plus ~200 LoC per domain consumer, compressing four industries' content (video, 3DGS, attention, gradients) through one Lance-backed wire format with a single λ-RDO knob per consumer — the codec is the substrate, everything else is a 200-line renaming, and the bench harness (Plan G) is the falsifiability proof.*
+
+That's the merged claim. Sections 1-11 elaborate, justify, sequence, and document the debt.
+
+---
+
+_Last edit: 2026-05-22 — merged canon from session A (this branch) and session B (PR #195 branch commit `01c77ccc`). Edit this doc when an M:* item resolves, a new merge-only epiphany lands, or a debt item graduates from open → resolved. Renumber only by appending — never reuse a retired ID._

From 4d0ec964038abd8636d84286f9b3d9043d600231 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 10:29:15 +0000
Subject: [PATCH 3/6] docs(codec): address PR #196 CodeRabbit + Codex review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two CodeRabbit MD040/MD027 findings + one Codex P2 stale-fact finding,
all verified valid against current code.

CodeRabbit (markdownlint):
- pr-x12-codec-cognitive-substrate-mapping.md: 4 untyped code fences
  → ```text (Skip/Merge/Delta block at line 56; depth-0..3 quad-tree
  blocks for spatial/attention/gradient hierarchies at lines 80/91/104)
- pr-x12-substrate-merged-canon.md: collapse `>   - ` to `> - `
  (lines 6-7, MD027 multi-space-after-blockquote), 2 untyped fences
  → ```text (architecture-rule list at line 165, sequencing diagram
  at line 411), and "delta IS" → "delta is" stylistic

Codex P2 (stale doc vs shipped code):
- Doc cited `MAX_BASIN_IDX = 4095` and framed BASIN_NONE collision
  as "pending / not yet merged / not yet pushed" — but PR #195 commit
  24232985 already shipped `MAX_BASIN_IDX = 4094` with `BASIN_NONE
  = 4095` reserved sentinel, plus bijective `pack_leaf` via `?`
  operator with 3 regression tests. Updated:
  - §4.1 line 125: `MAX_BASIN_IDX = 4094` + sentinel-range explainer
  - §4.3 heading + body: "(resolved in PR #195)" with commit cite
  - §"Still open on PR #195" block: → "Resolved in PR #195 follow-up"
  - §12 debt-inventory T-1, T-2: marked ~~RESOLVED~~ with commit ref

No content edits beyond what the findings asked for; citation IDs
unchanged.
---
 ...r-x12-codec-cognitive-substrate-mapping.md | 26 +++++++++----------
 .../pr-x12-substrate-merged-canon.md          | 10 +++----
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
index 5fefc278..954b28cc 100644
--- a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
+++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
@@ -53,7 +53,7 @@ bit  15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
 
 Two bits = four modes. Not more, not fewer. The four modes form a **strict cost lattice**:
 
-```
+```text
 Skip   (2 bytes total) ⊂  free
 Merge  (3 bytes)       ⊂  borrow from neighbour
 Delta  (3 bytes)       ⊂  store quantized perturbation
@@ -77,7 +77,7 @@ Monotone cost ordering is what lets `predict_intra` use a "first-fit cheapest" d
 
 ### 3.1 Spatial hierarchy (HEVC's original use)
 
-```
+```text
 depth 0 (64×64 CTU)   ↔ one BlockedGrid L1 block (`ctu.rs:236`)
 depth 1 (32×32 split) ↔ one CU at split-level 1
 depth 2 (16×16 split) ↔ ...
@@ -88,7 +88,7 @@ depth 3 (8×8 leaf)    ↔ leaf CU (smallest cognitively-meaningful unit)
 
 ### 3.2 Attention hierarchy (rediscovery as transformer)
 
-```
+```text
 depth 0 ↔ one attention layer
 depth 1 ↔ one attention head (32×32 = 1024 attention slots)
 depth 2 ↔ one multi-query-attention group (4 heads sharing KV)
@@ -101,7 +101,7 @@ Mistral / Llama4 sliding-window attention is **exactly depth-3 leaf processing**
 
 ### 3.3 Gradient-update hierarchy (the optimizer mapping)
 
-```
+```text
 depth 0 ↔ "should this parameter tensor be touched this step?"
 depth 1 ↔ "which block of this tensor needs update?"
 depth 2 ↔ "which row of that block?"
@@ -122,7 +122,7 @@ This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_comp
 
 ### 4.1 The 12-bit basin = 4096-entry vocabulary
 
-`MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:71`). Each `LeafCu` carries a 12-bit index into the per-frame basin codebook. For:
+`MAX_BASIN_IDX = (1 << 12) - 2 = 4094` (`mode.rs:79`), with `BASIN_NONE = 4095` reserved as the absent-basin sentinel — the 12-bit header field encodes `0..=4094` for real basins plus `4095` for "no basin assigned". Each `LeafCu` carries a 12-bit index into the per-frame basin codebook. For:
 
 - **Video**: 4096 palette entries per GOP — orders of magnitude more than HEVC SCC's 64-entry cap
 - **Splats**: 4096 splat archetypes (colour clusters × scale clusters × view-direction clusters) — covers a non-toy scene
@@ -145,9 +145,9 @@ The SCC team had to cap palette at 64 entries rebuilt per-CTU because that was t
 
 **Holy grail claim H-1**: PR-X12 + cam_pq gives you the screen-content video codec HEVC SCC was trying to be in 2013 — without retrofitting, just by composing existing modules. Cite this when somebody asks "why is the basin field 12 bits and not 8 like HEVC SCC".
 
-### 4.3 BASIN_NONE sentinel collision (PR #195 open issue)
+### 4.3 BASIN_NONE sentinel collision (resolved in PR #195)
 
-`BASIN_NONE = MAX_BASIN_IDX = 4095` (`mode.rs:79`) — basin 4095 is ambiguous on the wire (real basin vs "no basin" sentinel). Fix in PR #195: `MAX_BASIN_IDX = 4094`, `BASIN_NONE = 4095`. Costs one codebook entry (irrelevant for k-means usage). Flagged by CodeRabbit, not yet pushed. **Tracked in §12 below.**
+Original bug: `BASIN_NONE = MAX_BASIN_IDX = 4095` (pre-fix `mode.rs`) — basin 4095 was ambiguous on the wire (real basin vs "no basin" sentinel). Shipped fix in PR #195 (commit `24232985`): `MAX_BASIN_IDX = (1 << 12) - 2 = 4094`, `BASIN_NONE = 4095` reserved. Costs one codebook entry (irrelevant for k-means usage). Originally flagged by CodeRabbit; merged.
 
 ---
 
@@ -300,11 +300,11 @@ The mappings above are dense but specific. The holy grail claims are general and
 - ✅ Escape allocator collision (P1) → `Option<&mut u32>` cursor
 - ✅ NESW/NEWS doc mismatch (P1) → explicit slot table
 
-**Still open on PR #195**:
-- 🟡 `BASIN_NONE == MAX_BASIN_IDX == 4095` ambiguity → shrink MAX_BASIN_IDX to 4094
-- 🟡 `pack_leaf` `unwrap_or` fallbacks → switch to `?` operator (non-bijective serialisation)
+**Resolved in PR #195 follow-up (commit `24232985`)**:
+- ✅ `BASIN_NONE == MAX_BASIN_IDX == 4095` ambiguity → `MAX_BASIN_IDX = 4094`, `BASIN_NONE = 4095` reserved
+- ✅ `pack_leaf` `unwrap_or` fallbacks → switched to `?` operator (bijective serialisation; 3 regression tests added)
 
-Track in §12 below.
+Both originally listed in §12 below; entries updated.
 
 ### 10.2 PR-X12 A4 — transform
 
@@ -451,8 +451,8 @@ Track in §12 below.
 - None currently.
 
 **Severity P1** (fix before next-sub-card):
-- *T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision (`mode.rs:79`). Fix: `MAX_BASIN_IDX = 4094, BASIN_NONE = 4095`. Costs 1 codebook entry. **Flagged by CodeRabbit on PR #195, not yet merged.**
-- *T-2*: `pack_leaf` `unwrap_or` fallbacks (`mode.rs:194-210`). Make encode bijective: `leaf.merge_dir?` etc. Malformed `LeafCu` should be a None return, not a silent rewrite. **Flagged by CodeRabbit on PR #195, not yet merged.**
+- ~~*T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision (`mode.rs:79`).~~ **RESOLVED** via PR #195 commit `24232985`: `MAX_BASIN_IDX = 4094, BASIN_NONE = 4095`. Costs 1 codebook entry.
+- ~~*T-2*: `pack_leaf` `unwrap_or` fallbacks (`mode.rs:194-210`).~~ **RESOLVED** via PR #195 commit `24232985`: switched to `?` operator; 3 regression tests added (`leaf_pack_rejects_malformed_{merge,delta,escape}_without_*`).
 
 **Severity P2** (fix in follow-up):
 - *T-3*: A3-intra currently scans NEWS without RDO; replace with λ-weighted RDO when A6 lands. Today's first-fit policy is the right default for λ=0 but suboptimal for typical λ.
diff --git a/.claude/knowledge/pr-x12-substrate-merged-canon.md b/.claude/knowledge/pr-x12-substrate-merged-canon.md
index 7ff8d1ca..b6948cca 100644
--- a/.claude/knowledge/pr-x12-substrate-merged-canon.md
+++ b/.claude/knowledge/pr-x12-substrate-merged-canon.md
@@ -3,8 +3,8 @@
 > Date: 2026-05-22  
 > Status: **MERGED CANON** — synthesises two parallel sessions' findings into one doc  
 > Supersedes (for new content; keep originals for archeology):  
->   - `pr-x12-codec-cognitive-substrate-mapping.md` (session A: opus 4.7 main thread, this branch)  
->   - `pr-x12-cross-domain-synergies.md` (session B: parallel thread, PR #195 branch, commit `01c77ccc`)  
+> - `pr-x12-codec-cognitive-substrate-mapping.md` (session A: opus 4.7 main thread, this branch)
+> - `pr-x12-cross-domain-synergies.md` (session B: parallel thread, merged via PR #195, commit `01c77ccc`)
 > Sister doc: `pr-x12-codec-x265-design.md` (the mechanical spec, untouched)
 
 ---
@@ -162,7 +162,7 @@ A's T-16/T-17 (cross-repo dep direction problem) + B's D-STACK-6/D-STACK-12 (Lan
 The resolution is **already implicit** in the merged claim: after PR-X12 stabilises, extract `crate::hpc::codec::*` into a sibling crate `ndarray-codec`. Both `ndarray` and `lance-graph` then depend on it. The codec lives at the dep-bottom layer not as "ndarray hardware" but as **its own architectural category**.
 
 → Action: add a **fifth category** to the architecture rule in CLAUDE.md:
-```
+```text
 - ndarray = hardware (SIMD, Palette, Base17, SpoDistanceMatrices, read_bgz7_file)
 - ndarray-codec = compression substrate (Ctu, LeafCu, predict_intra, rANS) ← NEW
 - lance-graph = thinking (NarsTruth, NarsEngine, TripleModel, AutocompleteCache)
@@ -300,7 +300,7 @@ Merge of A's H-1..H-7 + B's HG1..HG6 + two new M:H-* claims that emerge from the
 
 **M:H-NEW-1** — The same Rust binary consumes (4K video frames | 1M-Gaussian 3DGS scene | 7B-LLM gradient stream | attention KV cache) and emits a compressed Lance column. One CLI. One codec. Four loads. **This is the falsifiability test** — build it (Plan G, the bench harness), prove HG1/H-7 by demonstration, not by argument.
 
-**M:H-NEW-2** — `trait PredictiveSignal` + `trait LinearReduce<Basis>` + `trait CurveOrder<const N: usize>` factor the codec into three plug-points (per M:E-E + M:E-A + M:E-B). The codec body is `<150 LoC of generic glue. Domain consumers ship `<200 LoC` of trait impls. **Total stack for all four industries: ~2 KLoC.** Compared to ~50 KLoC per-domain implementations elsewhere. The 25× code-density delta IS the architectural payoff that justifies the eight sub-cards.
+**M:H-NEW-2** — `trait PredictiveSignal` + `trait LinearReduce<Basis>` + `trait CurveOrder<const N: usize>` factor the codec into three plug-points (per M:E-E + M:E-A + M:E-B). The codec body is `<150 LoC of generic glue. Domain consumers ship `<200 LoC` of trait impls. **Total stack for all four industries: ~2 KLoC.** Compared to ~50 KLoC per-domain implementations elsewhere. The 25× code-density delta is the architectural payoff that justifies the eight sub-cards.
 
 ---
 
@@ -408,7 +408,7 @@ Replaces both A:§10 and B:§5 plan lists. Critical path resolved per M:E-F.
 
 ## 6. Sequencing diagram
 
-```
+```text
               ┌──────────────────────────────────────┐
               │   Plan G (multi-domain bench)        │
               │   2 weeks — UNFALSIFIABILITY GATE    │

From a075e127e7904bd5aa2f1a175f7ce0cf2cffa989 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 11:10:51 +0000
Subject: [PATCH 4/6] =?UTF-8?q?fix(codec):=20delete=20BASIN=5FNONE=20senti?=
 =?UTF-8?q?nel=20=E2=80=94=20restore=20full=204096-entry=20HHTL=20codebook?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR #195's follow-up commit 24232985 resolved the BASIN_NONE/MAX_BASIN_IDX
collision by shrinking MAX_BASIN_IDX to 4094 with BASIN_NONE = 4095 as a
distinct sentinel. The shrink was the wrong fix — the sentinel itself
was the bug.

Reference (the authoritative source the code drifted from):

  src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl — the HHTL
  ontology defines the codebook as 16 Hips × 16 Twigs × 16 Leaves =
  4096 Leaves per Heel, every Leaf carrying a real basinSignature.
  No slot is reserved for absence; the ontology forbids "no basin".

What BASIN_NONE actually was: authoring-time epistemic uncertainty
("encoder hasn't decided yet for this cell") smuggled into wire-format
ontological uncertainty ("this cell exists but has no basin"). A
category error — Rust's idiom would have surfaced this as Option<u16>
in the encoder's transient scratch state and a plain u16 on the
persistent record. The sentinel value bypassed that hygiene by hiding
optionality inside a magic u16, invisible to the typechecker.

Confirmation that the sentinel was load-bearing for nothing: full audit
of BASIN_NONE producers across src/ found zero — no LeafCu::skip(...),
no basin_idx: BASIN_NONE anywhere. The only references were the const
definition, a re-export, one regression test that constructed it, and
predict.rs:is_no_basin (a predicate with no callers that emit
BASIN_NONE leaves). Defensive infrastructure for a feature the
ontology forbids and the implementation never used.

Changes:

- mode.rs: MAX_BASIN_IDX = (1 << 12) - 1 = 4095 (full 12-bit range,
  4096 real basins). BASIN_NONE deleted. The 2 sentinel-distinctness
  regression tests replaced with one test asserting the full range.
  Docs reframed around the HHTL ontology + the Option<u16>
  scratch-state contract.
- predict.rs: is_no_basin deleted; its only doctest+test deleted.
  BASIN_NONE import dropped.
- mod.rs: BASIN_NONE and is_no_basin removed from re-exports.
- pr-x12-codec-cognitive-substrate-mapping.md §4.1/§4.3/§10.1/§12:
  framing updated. §4.3 now records the full arc (collision → shrink
  → delete) as design archeology rather than as an open issue.

Validation:
- cargo check --features codec --lib → clean
- cargo clippy --features codec --lib -- -D warnings → clean
- cargo test --features codec --lib hpc::codec → 54 tests pass
- cargo test --features codec --doc hpc::codec → 13 doctests pass

Net diff: -77, +38 lines. Pure subtraction modulo doc reframe.
---
 ...r-x12-codec-cognitive-substrate-mapping.md | 21 ++++--
 src/hpc/codec/mod.rs                          |  4 +-
 src/hpc/codec/mode.rs                         | 67 ++++++-------------
 src/hpc/codec/predict.rs                      | 23 -------
 4 files changed, 38 insertions(+), 77 deletions(-)

diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
index 954b28cc..7a3bad7f 100644
--- a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
+++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
@@ -122,7 +122,7 @@ This is **what DeepSpeed-ZeRO does informally** with `bf16_compress`, `int8_comp
 
 ### 4.1 The 12-bit basin = 4096-entry vocabulary
 
-`MAX_BASIN_IDX = (1 << 12) - 2 = 4094` (`mode.rs:79`), with `BASIN_NONE = 4095` reserved as the absent-basin sentinel — the 12-bit header field encodes `0..=4094` for real basins plus `4095` for "no basin assigned". Each `LeafCu` carries a 12-bit index into the per-frame basin codebook. For:
+`MAX_BASIN_IDX = (1 << 12) - 1 = 4095` (`mode.rs:79`). The full 12-bit range addresses 4096 real basins — every `LeafCu` carries an index into a fully-populated per-Heel codebook. No slot is reserved as a sentinel: the HHTL ontology (`Heel > Hip > Twig > Leaf`, see `src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl`) defines the codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`, every Leaf carrying a real `basinSignature`. Authoring-time uncertainty ("not yet decided") stays in the encoder's `Option<u16>` scratch state and never leaks onto the wire. For:
 
 - **Video**: 4096 palette entries per GOP — orders of magnitude more than HEVC SCC's 64-entry cap
 - **Splats**: 4096 splat archetypes (colour clusters × scale clusters × view-direction clusters) — covers a non-toy scene
@@ -145,9 +145,16 @@ The SCC team had to cap palette at 64 entries rebuilt per-CTU because that was t
 
 **Holy grail claim H-1**: PR-X12 + cam_pq gives you the screen-content video codec HEVC SCC was trying to be in 2013 — without retrofitting, just by composing existing modules. Cite this when somebody asks "why is the basin field 12 bits and not 8 like HEVC SCC".
 
-### 4.3 BASIN_NONE sentinel collision (resolved in PR #195)
+### 4.3 BASIN_NONE — the sentinel that shouldn't have existed
 
-Original bug: `BASIN_NONE = MAX_BASIN_IDX = 4095` (pre-fix `mode.rs`) — basin 4095 was ambiguous on the wire (real basin vs "no basin" sentinel). Shipped fix in PR #195 (commit `24232985`): `MAX_BASIN_IDX = (1 << 12) - 2 = 4094`, `BASIN_NONE = 4095` reserved. Costs one codebook entry (irrelevant for k-means usage). Originally flagged by CodeRabbit; merged.
+Arc summary (for archeology):
+
+- **Original state** (PR #195 first push): `BASIN_NONE = MAX_BASIN_IDX = 4095` — same value used both as "max valid basin" and "no basin assigned" sentinel. Ambiguous on the wire. CodeRabbit flagged.
+- **First fix** (PR #195 follow-up commit `24232985`): shrunk `MAX_BASIN_IDX` to 4094, kept `BASIN_NONE = 4095` distinct. Resolved the ambiguity but sacrificed one codebook entry and propagated the sentinel into the wire format.
+- **Reference check**: the HHTL ontology (`Leaf.ttl`) requires 4096 real Leaves per Heel, every Leaf a real basin. The sentinel itself was the wrong design, not just its value.
+- **Final fix**: `BASIN_NONE` and `is_no_basin` deleted entirely; `MAX_BASIN_IDX` restored to `(1 << 12) - 1 = 4095`. Authoring-time "not yet decided" uncertainty stays as an `Option<u16>` in encoder scratch state and is resolved before any leaf reaches the wire. The wire format only ever sees committed basins.
+
+**The deeper read**: `BASIN_NONE` was authoring-time epistemic uncertainty (encoder mid-decision) smuggled into wire-format ontological uncertainty (cell exists but has no basin) — a category error. Rust's idiom would have surfaced this as `Option<u16>` in the encoder's transient state and `u16` on the persistent record. The sentinel value bypassed that hygiene by hiding optionality inside a magic `u16` value. The cleanup restores the two-types-for-two-lifecycles separation.
 
 ---
 
@@ -300,9 +307,9 @@ The mappings above are dense but specific. The holy grail claims are general and
 - ✅ Escape allocator collision (P1) → `Option<&mut u32>` cursor
 - ✅ NESW/NEWS doc mismatch (P1) → explicit slot table
 
-**Resolved in PR #195 follow-up (commit `24232985`)**:
-- ✅ `BASIN_NONE == MAX_BASIN_IDX == 4095` ambiguity → `MAX_BASIN_IDX = 4094`, `BASIN_NONE = 4095` reserved
-- ✅ `pack_leaf` `unwrap_or` fallbacks → switched to `?` operator (bijective serialisation; 3 regression tests added)
+**Resolved across PR #195 + follow-up**:
+- ✅ `pack_leaf` `unwrap_or` fallbacks (PR #195 commit `24232985`) → switched to `?` operator (bijective serialisation; 3 regression tests added)
+- ✅ `BASIN_NONE` sentinel removed entirely (follow-up to PR #195): `MAX_BASIN_IDX = 4095` restores full 4096-entry HHTL codebook; authoring-time uncertainty lives in encoder `Option<u16>` scratch, not the wire format. See §4.3 for the arc.
 
 Both originally listed in §12 below; entries updated.
 
@@ -451,7 +458,7 @@ Both originally listed in §12 below; entries updated.
 - None currently.
 
 **Severity P1** (fix before next-sub-card):
-- ~~*T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision (`mode.rs:79`).~~ **RESOLVED** via PR #195 commit `24232985`: `MAX_BASIN_IDX = 4094, BASIN_NONE = 4095`. Costs 1 codebook entry.
+- ~~*T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision.~~ **RESOLVED** via two-step landing: PR #195 commit `24232985` shrunk `MAX_BASIN_IDX` to 4094 with `BASIN_NONE = 4095` as a distinct sentinel; the follow-up cleanup deleted `BASIN_NONE` and `is_no_basin` entirely and restored `MAX_BASIN_IDX = 4095`. Reference: the HHTL ontology (`Leaf.ttl`) defines 4096 real Leaves per Heel — no slot reserved for absence. See §4.3.
 - ~~*T-2*: `pack_leaf` `unwrap_or` fallbacks (`mode.rs:194-210`).~~ **RESOLVED** via PR #195 commit `24232985`: switched to `?` operator; 3 regression tests added (`leaf_pack_rejects_malformed_{merge,delta,escape}_without_*`).
 
 **Severity P2** (fix in follow-up):
diff --git a/src/hpc/codec/mod.rs b/src/hpc/codec/mod.rs
index 2f23c294..9a46a24a 100644
--- a/src/hpc/codec/mod.rs
+++ b/src/hpc/codec/mod.rs
@@ -32,7 +32,7 @@ pub mod predict;
 pub use ctu::{CellMode, MergeDir, MAX_QUAD_TREE_NODES, MAX_SPLIT_DEPTH};
 pub use ctu::{Ctu, CtuArena, CtuPartition, LeafCu, MaxSplitDepthReached, MergeError, NodeIdx};
 pub use mode::{
-    pack_header, pack_leaf, pack_merge_dir, packed_byte_len, unpack_header, unpack_leaf, unpack_merge_dir, BASIN_NONE,
+    pack_header, pack_leaf, pack_merge_dir, packed_byte_len, unpack_header, unpack_leaf, unpack_merge_dir,
     MAX_BASIN_IDX,
 };
-pub use predict::{is_no_basin, predict_intra, IntraConfig, IntraContext};
+pub use predict::{predict_intra, IntraConfig, IntraContext};
diff --git a/src/hpc/codec/mode.rs b/src/hpc/codec/mode.rs
index 4d756812..c21b5528 100644
--- a/src/hpc/codec/mode.rs
+++ b/src/hpc/codec/mode.rs
@@ -59,40 +59,25 @@ use super::ctu::{CellMode, LeafCu, MergeDir};
 // Header pack / unpack (16-bit)
 // ════════════════════════════════════════════════════════════════════
 
-/// Maximum encodable real `basin_idx`. Equal to `(1 << 12) - 2 = 4094`
-/// so that the all-ones 12-bit pattern (`0xFFF = 4095`) is reserved as
-/// the [`BASIN_NONE`] sentinel — without that reservation, basin 4095
-/// would round-trip ambiguously with "no basin assigned".
+/// Maximum encodable `basin_idx`. Equal to `(1 << 12) - 1 = 4095`,
+/// the full 12-bit range. Every value `0..=MAX_BASIN_IDX` addresses a
+/// real basin in the per-Heel codebook — there is no reserved sentinel.
 ///
-/// The on-wire 12-bit field still holds any value `0..=0xFFF`; only the
-/// encoder's *valid-basin* range is restricted to `0..=MAX_BASIN_IDX`.
-/// [`BASIN_NONE`] is encodable in the header field too (when an encoder
-/// emits a "no basin" record), but it must never appear as a real basin
-/// codebook index.
+/// The HHTL ontology (`Heel > Hip > Twig > Leaf`, see the entity TTLs
+/// under `src/hpc/ogit_bridge/assets/cognitive/entities/`) defines the
+/// codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`,
+/// every Leaf carrying a real `basinSignature`. Absence is not a state:
+/// authoring-time "not yet decided" lives in the encoder's `Option<u16>`
+/// scratch state, never on the wire.
 ///
 /// ```
-/// use ndarray::hpc::codec::{BASIN_NONE, MAX_BASIN_IDX};
-/// assert_eq!(MAX_BASIN_IDX, (1 << 12) - 2);
-/// assert_eq!(MAX_BASIN_IDX, 4094);
-/// assert!(MAX_BASIN_IDX < BASIN_NONE);
+/// use ndarray::hpc::codec::MAX_BASIN_IDX;
+/// assert_eq!(MAX_BASIN_IDX, (1 << 12) - 1);
+/// assert_eq!(MAX_BASIN_IDX, 4095);
 /// ```
-pub const MAX_BASIN_IDX: u16 = (1 << 12) - 2; // 4094
-
-/// Tag inside the per-frame basin codebook for "no basin assigned"
-/// (encoder-side sentinel during mode decision). Equal to `0xFFF`
-/// (the all-ones 12-bit pattern) so it sits one slot above the highest
-/// real basin index ([`MAX_BASIN_IDX`]).
-///
-/// ```
-/// use ndarray::hpc::codec::{BASIN_NONE, MAX_BASIN_IDX};
-/// assert_eq!(BASIN_NONE, 4095);
-/// assert_eq!(BASIN_NONE, MAX_BASIN_IDX + 1);
-/// ```
-pub const BASIN_NONE: u16 = (1 << 12) - 1;
+pub const MAX_BASIN_IDX: u16 = (1 << 12) - 1; // 4095
 
 /// Private: 12-bit mask for the basin field of the packed header.
-/// Independent of [`MAX_BASIN_IDX`] so that [`BASIN_NONE`] (which sits
-/// in the 12-bit field but is not a real basin) still round-trips.
 const BASIN_FIELD_MASK: u16 = 0x0FFF;
 
 /// Pack `(mode, basin_idx)` into a 16-bit header.
@@ -442,26 +427,18 @@ mod tests {
     }
 
     #[test]
-    fn basin_none_distinct_from_max_basin_idx() {
-        // Regression for the BASIN_NONE/MAX_BASIN_IDX collision: the
-        // sentinel must sit one slot above the highest real basin so
-        // basin 4094 is unambiguously "a real basin" and 4095 is
-        // unambiguously "no basin assigned".
-        assert_eq!(MAX_BASIN_IDX, 4094);
-        assert_eq!(BASIN_NONE, 4095);
-        assert!(MAX_BASIN_IDX < BASIN_NONE);
+    fn max_basin_idx_fills_full_12bit_range() {
+        // The codec follows the HHTL ontology: 16 × 16 × 16 = 4096 Leaves
+        // per Heel, every value `0..=MAX_BASIN_IDX` is a real basin. No
+        // sentinel slot is reserved.
+        assert_eq!(MAX_BASIN_IDX, (1 << 12) - 1);
+        assert_eq!(MAX_BASIN_IDX, 4095);
     }
 
     #[test]
-    fn header_round_trips_max_basin_idx_and_basin_none_distinctly() {
-        // Both values fit in the 12-bit field; the encoder treats them
-        // as different. (Decoders that route on BASIN_NONE need to
-        // compare against the sentinel explicitly.)
-        let real = pack_header(CellMode::Skip, MAX_BASIN_IDX);
-        let none = pack_header(CellMode::Skip, BASIN_NONE);
-        assert_ne!(real, none);
-        assert_eq!(unpack_header(real), (CellMode::Skip, MAX_BASIN_IDX));
-        assert_eq!(unpack_header(none), (CellMode::Skip, BASIN_NONE));
+    fn header_round_trips_max_basin_idx() {
+        let h = pack_header(CellMode::Skip, MAX_BASIN_IDX);
+        assert_eq!(unpack_header(h), (CellMode::Skip, MAX_BASIN_IDX));
     }
 
     #[test]
diff --git a/src/hpc/codec/predict.rs b/src/hpc/codec/predict.rs
index 3014cb8a..83e6b65d 100644
--- a/src/hpc/codec/predict.rs
+++ b/src/hpc/codec/predict.rs
@@ -58,7 +58,6 @@
 //!   reference + reconstruction parity test pin the math.
 
 use super::ctu::{CellMode, LeafCu, MergeDir};
-use super::mode::BASIN_NONE;
 
 // ════════════════════════════════════════════════════════════════════
 // Inputs to the encoder mode decision
@@ -293,21 +292,6 @@ fn merge_dir_from_index(i: usize) -> MergeDir {
     }
 }
 
-/// Sanity-check sentinel: returns `true` iff the resolved basin index
-/// is the "no basin" marker. Encoders that compute basins lazily can
-/// short-circuit Skip/Merge/Delta and emit Escape directly when this
-/// fires.
-///
-/// ```
-/// use ndarray::hpc::codec::{is_no_basin, BASIN_NONE};
-/// assert!(is_no_basin(BASIN_NONE));
-/// assert!(!is_no_basin(0));
-/// ```
-#[inline]
-pub fn is_no_basin(basin_idx: u16) -> bool {
-    basin_idx == BASIN_NONE
-}
-
 // ════════════════════════════════════════════════════════════════════
 // Tests
 // ════════════════════════════════════════════════════════════════════
@@ -478,13 +462,6 @@ mod tests {
         assert_eq!(decoded, leaf);
     }
 
-    #[test]
-    fn is_no_basin_sentinel_round_trip() {
-        assert!(is_no_basin(BASIN_NONE));
-        assert!(!is_no_basin(0));
-        assert!(!is_no_basin(100));
-    }
-
     #[test]
     fn overflow_delta_does_not_alias_to_merge() {
         // Regression for the wrapping-cast Merge alias bug:

From 7fdcfb56ff633f1342443006c084279d82bc0165 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 11:25:08 +0000
Subject: [PATCH 5/6] docs(codec): tighten basin-index wording to the HHTL
 ontology statement
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Drops the forensic call-outs and walks the docs back to the positive
design statement: the codebook is 16 × 16 × 16 = 4096 Leaves per Heel
(per Leaf.ttl), MAX_BASIN_IDX = 4095 addresses the full range. No
narrative arc; no archeology subsection.
---
 ...r-x12-codec-cognitive-substrate-mapping.md | 19 +++---------------
 src/hpc/codec/mode.rs                         | 20 +++++++++----------
 2 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
index 7a3bad7f..0c60e841 100644
--- a/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
+++ b/.claude/knowledge/pr-x12-codec-cognitive-substrate-mapping.md
@@ -145,17 +145,6 @@ The SCC team had to cap palette at 64 entries rebuilt per-CTU because that was t
 
 **Holy grail claim H-1**: PR-X12 + cam_pq gives you the screen-content video codec HEVC SCC was trying to be in 2013 — without retrofitting, just by composing existing modules. Cite this when somebody asks "why is the basin field 12 bits and not 8 like HEVC SCC".
 
-### 4.3 BASIN_NONE — the sentinel that shouldn't have existed
-
-Arc summary (for archeology):
-
-- **Original state** (PR #195 first push): `BASIN_NONE = MAX_BASIN_IDX = 4095` — same value used both as "max valid basin" and "no basin assigned" sentinel. Ambiguous on the wire. CodeRabbit flagged.
-- **First fix** (PR #195 follow-up commit `24232985`): shrunk `MAX_BASIN_IDX` to 4094, kept `BASIN_NONE = 4095` distinct. Resolved the ambiguity but sacrificed one codebook entry and propagated the sentinel into the wire format.
-- **Reference check**: the HHTL ontology (`Leaf.ttl`) requires 4096 real Leaves per Heel, every Leaf a real basin. The sentinel itself was the wrong design, not just its value.
-- **Final fix**: `BASIN_NONE` and `is_no_basin` deleted entirely; `MAX_BASIN_IDX` restored to `(1 << 12) - 1 = 4095`. Authoring-time "not yet decided" uncertainty stays as an `Option<u16>` in encoder scratch state and is resolved before any leaf reaches the wire. The wire format only ever sees committed basins.
-
-**The deeper read**: `BASIN_NONE` was authoring-time epistemic uncertainty (encoder mid-decision) smuggled into wire-format ontological uncertainty (cell exists but has no basin) — a category error. Rust's idiom would have surfaced this as `Option<u16>` in the encoder's transient state and `u16` on the persistent record. The sentinel value bypassed that hygiene by hiding optionality inside a magic `u16` value. The cleanup restores the two-types-for-two-lifecycles separation.
-
 ---
 
 ## 5. Transform basis — DCT-II ↔ optimizer preconditioner ↔ wavelet ↔ learned
@@ -307,11 +296,10 @@ The mappings above are dense but specific. The holy grail claims are general and
 - ✅ Escape allocator collision (P1) → `Option<&mut u32>` cursor
 - ✅ NESW/NEWS doc mismatch (P1) → explicit slot table
 
-**Resolved across PR #195 + follow-up**:
-- ✅ `pack_leaf` `unwrap_or` fallbacks (PR #195 commit `24232985`) → switched to `?` operator (bijective serialisation; 3 regression tests added)
-- ✅ `BASIN_NONE` sentinel removed entirely (follow-up to PR #195): `MAX_BASIN_IDX = 4095` restores full 4096-entry HHTL codebook; authoring-time uncertainty lives in encoder `Option<u16>` scratch, not the wire format. See §4.3 for the arc.
+**Resolved in PR #195 follow-up (commit `24232985`)**:
+- ✅ `pack_leaf` `unwrap_or` fallbacks → switched to `?` operator (bijective serialisation; 3 regression tests added)
 
-Both originally listed in §12 below; entries updated.
+Originally listed in §12 below; entry updated.
 
 ### 10.2 PR-X12 A4 — transform
 
@@ -458,7 +446,6 @@ Both originally listed in §12 below; entries updated.
 - None currently.
 
 **Severity P1** (fix before next-sub-card):
-- ~~*T-1*: `BASIN_NONE == MAX_BASIN_IDX` collision.~~ **RESOLVED** via two-step landing: PR #195 commit `24232985` shrunk `MAX_BASIN_IDX` to 4094 with `BASIN_NONE = 4095` as a distinct sentinel; the follow-up cleanup deleted `BASIN_NONE` and `is_no_basin` entirely and restored `MAX_BASIN_IDX = 4095`. Reference: the HHTL ontology (`Leaf.ttl`) defines 4096 real Leaves per Heel — no slot reserved for absence. See §4.3.
 - ~~*T-2*: `pack_leaf` `unwrap_or` fallbacks (`mode.rs:194-210`).~~ **RESOLVED** via PR #195 commit `24232985`: switched to `?` operator; 3 regression tests added (`leaf_pack_rejects_malformed_{merge,delta,escape}_without_*`).
 
 **Severity P2** (fix in follow-up):
diff --git a/src/hpc/codec/mode.rs b/src/hpc/codec/mode.rs
index c21b5528..2b0aceef 100644
--- a/src/hpc/codec/mode.rs
+++ b/src/hpc/codec/mode.rs
@@ -60,15 +60,13 @@ use super::ctu::{CellMode, LeafCu, MergeDir};
 // ════════════════════════════════════════════════════════════════════
 
 /// Maximum encodable `basin_idx`. Equal to `(1 << 12) - 1 = 4095`,
-/// the full 12-bit range. Every value `0..=MAX_BASIN_IDX` addresses a
-/// real basin in the per-Heel codebook — there is no reserved sentinel.
+/// the full 12-bit range.
 ///
-/// The HHTL ontology (`Heel > Hip > Twig > Leaf`, see the entity TTLs
-/// under `src/hpc/ogit_bridge/assets/cognitive/entities/`) defines the
-/// codebook as `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`,
-/// every Leaf carrying a real `basinSignature`. Absence is not a state:
-/// authoring-time "not yet decided" lives in the encoder's `Option<u16>`
-/// scratch state, never on the wire.
+/// The codebook follows the HHTL ontology (`Heel > Hip > Twig > Leaf`,
+/// see the entity TTLs under `src/hpc/ogit_bridge/assets/cognitive/entities/`):
+/// `16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel`. Every value
+/// `0..=MAX_BASIN_IDX` addresses one Leaf, each carrying a real
+/// `basinSignature`.
 ///
 /// ```
 /// use ndarray::hpc::codec::MAX_BASIN_IDX;
@@ -428,9 +426,9 @@ mod tests {
 
     #[test]
     fn max_basin_idx_fills_full_12bit_range() {
-        // The codec follows the HHTL ontology: 16 × 16 × 16 = 4096 Leaves
-        // per Heel, every value `0..=MAX_BASIN_IDX` is a real basin. No
-        // sentinel slot is reserved.
+        // 16 Hips × 16 Twigs × 16 Leaves = 4096 Leaves per Heel
+        // (HHTL ontology, see Leaf.ttl). Every value 0..=MAX_BASIN_IDX
+        // addresses one Leaf.
         assert_eq!(MAX_BASIN_IDX, (1 << 12) - 1);
         assert_eq!(MAX_BASIN_IDX, 4095);
     }

From d272fea713110da25ea922fd827545f1175e7f24 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Fri, 22 May 2026 11:32:02 +0000
Subject: [PATCH 6/6] docs(codec): record the no-sentinel invariant as a
 maintainer comment
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two-layer rule sitting above MAX_BASIN_IDX:
  - encoder scratch state  → Option<u16>
  - wire format / persisted → plain u16, every value real

So the next editor doesn't reach for a magic value when what they
actually want is type-level optionality.
---
 src/hpc/codec/mode.rs | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/hpc/codec/mode.rs b/src/hpc/codec/mode.rs
index 2b0aceef..9dbe894f 100644
--- a/src/hpc/codec/mode.rs
+++ b/src/hpc/codec/mode.rs
@@ -59,6 +59,24 @@ use super::ctu::{CellMode, LeafCu, MergeDir};
 // Header pack / unpack (16-bit)
 // ════════════════════════════════════════════════════════════════════
 
+// ── Design invariant ────────────────────────────────────────────────
+//
+// The 12-bit basin field addresses exactly one of 4096 real Leaves
+// (HHTL ontology, see `src/hpc/ogit_bridge/assets/cognitive/entities/Leaf.ttl`:
+// 16 Hips × 16 Twigs × 16 Leaves per Heel, every Leaf carries a real
+// `basinSignature`). Do NOT reserve a value as a "no basin" /
+// "not yet decided" sentinel.
+//
+// Rationale: that's authoring-time uncertainty (encoder mid-decision)
+// leaking into wire-format ontological state. Keep optionality in the
+// type, not in a magic value:
+//
+//   - encoder scratch state → `Option<u16>` (Some = committed, None = TBD)
+//   - persisted / wire-format → plain `u16`, every value real
+//
+// The doubt must collapse to Some(basin) before the leaf is packed.
+// Once a leaf reaches the wire it has a basin, period.
+
 /// Maximum encodable `basin_idx`. Equal to `(1 << 12) - 1 = 4095`,
 /// the full 12-bit range.
 ///