docs(pr-x12): close 6 CodeRabbit nitpicks left open at PR #197 merge

claude · claude · commit 6f96a1472b2a · 2026-05-22T17:39:01.000Z
Per other-session feedback: six nitpick-level findings on PR #197 didn't block the merge but stayed unaddressed. Folding them into this PR. 1. GGUF Escape forward-ref to F-4 gguf-llm-weights-encoding.md §2.4 said "Escape must be lossless ... This is an additional R-N candidate" with no pointer. F-4 in §10 already explains the mechanism (rANS bypass channel in A8, HEVC escape-coefficient precedent). Added an inline cross-ref so readers don't have to scroll to find the resolution. 2. Phone-class viability overclaim re KV cache gguf-llm-weights-encoding.md line 269 claimed "7B at PR-X12 is genuinely runnable on a phone-class device". Weight compression alone takes 7B from 4 GB to 3 GB, but KV cache at 8K context is ~1-2 GB independent of weight compression. Qualified the claim: PR-X12 weights are necessary but not sufficient; KV-cache lane (Plan D, M:H-3, R-4) is the second lever for full phone viability. 3. EncodingDomain::LLMWeights timing §11 implication #2 says "LLM lane lands post-PR-X12, but the harness must be lane-extensible"; implication #5 said "Reserve an EncodingDomain::LLMWeights discriminant ... now". Clarified: PR-X12 reserves the enum-discriminant *slot* now (forward-compat lock); the LLMWeights variant + decoder land post-PR-X12 without a wire-format break. 4. Per-arch `match Arch::CURRENT` const-eval woa-multiarch-orchestration.md §3.3's `const DCT_BATCH_CROSSOVER = match Arch::CURRENT { ... }` does not compile under stable Rust const-eval — `Arch::CURRENT` would need to be a const, and architecture-conditional const matches require build-script-emitted integers or `cfg!(target_feature = ...)`. Rewrote as pseudocode pointing at a `build.rs` mechanism (decision matrix → `OUT_DIR` generated const) with a `cfg!()` fallback shape. 5. G-8 / G-9 numbering collision cam-pq-sigker-dn-tree-substrate-bindings.md §5 labelled bgz-jc's two prior gaps as G-8 / G-9 (continuing cam-pq's own G-1..G-7), but bgz-jc-substrate-synergies.md §5 didn't use any G-N IDs, so the cross-doc reference was dangling and the namespace was implicitly shared without rules. Gave bgz-jc §5.1 / §5.2 explicit IDs G-1 / G-2 (canonical to that doc) and updated cam-pq to cite them as "bgz-jc G-1" / "bgz-jc G-2" with a namespace-isolation note. 6. "landed" terminology in x266 §8 prerequisites table The status column claimed "landed" / "landed in concept" for R-1 trait shape, R-2 header bytes, and R-13 federated codebook policy. None of these have shipping code — they are canon-fixed (the resolution doc commits the design) but implementation is scheduled in Plan A4 / A8 / F. Renamed status to "canon-fixed" + a glossary line distinguishing "canon-fixed" (doc commitment) from "scheduled" (plan card exists) from shipping code. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
diff --git a/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md
@@ -303,7 +303,7 @@ This is the doc-level value of PR-X12: bgz code + PR-X12 docs = a complete archi
 
 ## 5. Gaps — what doesn't exist yet
 
-### 5.1 `jd-nd` — the missing ndarray-side proof crate
+### 5.1 `jd-nd` — the missing ndarray-side proof crate (Gap **G-1**)
 
 The Explore search confirmed: `jd-nd` does not exist in `/home/user/ndarray/`. The math-proof infrastructure on the ndarray side lives ad-hoc inside `src/hpc/` modules (`deepnsm.rs`, `jina/runtime.rs`) as TODO comments.
 
@@ -335,7 +335,7 @@ ndarray/crates/jd-nd/
 
 **Why now:** R-11's latency CI needs a *correctness* twin. Latency that's fast but wrong is the worst outcome. jd-nd is the structural place for those proofs.
 
-### 5.2 Cronbach / ICC research crate
+### 5.2 Cronbach / ICC research crate (Gap **G-2**)
 
 `lance-graph/crates/lance-graph-codec-research/` exists per the Explore agent's report, **but its scope is FFT (rustfft) variants**, not Cronbach's α / ICC / encoding-reliability psychometrics.
 
diff --git a/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md
@@ -308,12 +308,14 @@ Updating the inventory from `pr-x12-bgz-jc-substrate-synergies.md` §7 with the
 
 **Total estimated gap-closing work: 8-12 weeks** across the seven items, all incremental on existing infrastructure. None of them require new research; all are wiring existing primitives into the codec.
 
-Two prior gaps from the earlier doc remain:
+Two prior gaps from the earlier doc remain (their canonical IDs are owned by `pr-x12-bgz-jc-substrate-synergies.md` §5; cross-referenced here):
 
-| Gap (prior) | Component | Cost |
+| Gap (cross-ref) | Component | Cost |
 |---|---|---|
-| **G-8** | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing |
-| **G-9** | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC |
+| **bgz-jc G-1** (§5.1) | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing |
+| **bgz-jc G-2** (§5.2) | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC |
+
+The G-1..G-7 IDs in §5 of *this* doc are local to the cam-pq / sigker / dn_tree binding; bgz-jc's G-1 / G-2 are a separate namespace owned by that doc. When citing cross-doc, prefix with the source (e.g., "bgz-jc G-1" vs "cam-pq G-1") to avoid the collision the previous G-8 / G-9 labelling implied.
 
 **Grand total: ~11-17 weeks** of substrate-binding + gap-closing work, parallel-able. PR-X12 codec body (~1500 LoC per R-3) is independent of this and can ship sooner.
 
diff --git a/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md
@@ -131,7 +131,7 @@ Crucially, the residual is **rANS-coded with a Gaussian-tail prior** (R-10). GGU
 
 For weights that are too extreme to fit any basin (the activation outliers that LLM.int8() and SmoothQuant fight over), encode as Escape + raw f16 value. ~3-5% of weights per layer, but they carry disproportionate information.
 
-The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate.
+The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate; see §10 falsifier **F-4** for the wire-format mechanism (rANS bypass channel in the A8 framing layer) and the HEVC-escape-coefficient precedent.
 
 ---
 
@@ -266,7 +266,9 @@ Per GEMM operation (e.g., compute attn_q @ x for batch):
 
 The CTU bitstream is read forward-only (rANS is a streaming codec) and the decoded weights live in L1/L2 cache just long enough to be GEMM'd. **No full-tensor dequantize buffer needed.** For a 4096 × 4096 attention projection, the dequantize buffer would be 32 MB (f16); PR-X12 streams in ~3-4 MB of bitstream, decodes to ~64 KB cache-resident windows, GEMMs each window, drops it.
 
-**Memory savings:** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." A 7B model at PR-X12 is genuinely runnable on a phone-class device, where GGUF Q4 is borderline.
+**Memory savings (weights only):** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch."
+
+**Phone-class caveat — weights are not the only memory load.** The KV cache scales with context length and is independent of weight compression: for a 7B model at 8K context, KV cache is ~2 GB in fp16 / ~1 GB in int8, and grows linearly with context. PR-X12 weight compression alone takes a 7B from "borderline" to "easier" on phone-class hardware, but **the KV cache lane (Plan D, M:H-3, R-4) is the second lever** that has to compress for full phone-class viability at non-trivial context. Both lanes are needed; this lens only addresses the weights side.
 
 **Latency:** the streaming decode happens in the same loop body as the GEMM accumulate. On a modern arch with VNNI + AMX, the decode cost (~5-10 cycles per cell, branchless via R-1's lookup-table pattern) is hidden by GEMM latency. **Estimated overhead: < 5% versus pre-dequantized GEMM.**
 
@@ -345,7 +347,7 @@ Concrete implications:
 
 4. **Do** keep R-13's federated codebook policy. The LLM use case is the strongest motivation: per-model codebooks are 13 MB; without R-13, a hard-coded codebook would not work for arbitrary LLMs.
 
-5. **Reserve** an `EncodingDomain::LLMWeights` discriminant in the codec metadata header (separate from the 16-bit per-CTU header). The codec body doesn't read this — it just stamps the file with a domain tag so decoders know which basin codebook to load.
+5. **Reserve** the *enum-discriminant slot* for `EncodingDomain::LLMWeights` in the codec metadata header *now*, even though the actual LLM-lane decoder lands post-PR-X12 (per implication #2). The header reserves a fixed-size domain-tag field (separate from the 16-bit per-CTU header); the LLMWeights value of that field stays unimplemented in PR-X12, but the slot is forward-compatibility-locked so a future PR can add the variant without a wire-format break. The codec body doesn't read this — it stamps the file with a domain tag so decoders know which basin codebook to load.
 
 6. **Bench against AWQ at parity perplexity, not just Q4_K_M.** Q4_K_M is a conservative baseline; AWQ + GPTQ are the actual state of the art. If PR-X12 can match AWQ at smaller storage, the case is strong; if not, ship at "drop-in GGUF replacement" framing only.
 
diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md
@@ -130,17 +130,29 @@ pub fn batched_gemm(input: ...) {
 
 ### 3.3 Per-arch tunable crossover (R-5 generalised)
 
-Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch:
+Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target arch at *compile time* (via `target_arch` / `target_feature` cfgs + a feature-detection probe) and emits the chosen integer as a generated `const` in `OUT_DIR`:
 
 ```rust
-const DCT_BATCH_CROSSOVER: usize = match Arch::CURRENT {
-    Arch::SapphireRapids => 64,   // AMX wins above this
-    Arch::IceLakeServer => 32,    // AVX-512 narrower; lower crossover
-    Arch::Zen4 => 96,             // Zen's AVX-512 emulation widens crossover
-    Arch::AppleM3 => 256,         // NEON's narrower; only worth at large N
-    Arch::GravitonV3 => 128,      // SVE2 mid-range
-    Arch::Generic => usize::MAX,  // Always scalar fallback
-};
+// Shape of the per-arch table (lives in a build-script-generated file
+// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))):
+//
+//   pub const DCT_BATCH_CROSSOVER: usize = 64;  // emitted by build.rs
+//                                                // for Sapphire Rapids
+//
+// The build script's decision matrix:
+//   Sapphire Rapids (target_feature = "avx512f,amx-bf16")  → 64
+//   Ice Lake / Skylake-X (avx512f only)                    → 32
+//   Zen 4 (avx512f, no AMX)                                → 96
+//   Apple Silicon (target_arch = "aarch64" + NEON)         → 256
+//   Graviton 3 (aarch64 + SVE2)                            → 128
+//   Generic / no SIMD                                       → usize::MAX
+//
+// Equivalent fallback if a future Rust stabilises const target-feature
+// detection, then this can become a runtime-stable const:
+//   const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 }
+//                                       else if cfg!(target_feature = "avx512f") { 32 }
+//                                       else if cfg!(target_arch = "aarch64") { 128 }
+//                                       else { usize::MAX };
 
 pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
     if N >= DCT_BATCH_CROSSOVER {
@@ -151,7 +163,7 @@ pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
 }
 ```
 
-R-5 commits these crossovers as **bench-tunable constants**, not hand-guessed numbers. Plan G's codec-bench includes a calibration sub-target that emits the right `const` values per arch via build script.
+R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum. The build script is the single source of truth for which integer compiles in.
 
 ---
 
diff --git a/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md
@@ -268,12 +268,14 @@ Nothing in this doc is in PR-X12 scope. What it requires from PR-X12:
 
 | Requirement | Source | Status |
 |---|---|---|
-| `Basis<T>` trait with parametric `apply` | R-1, M:E-A | landed in concept; implementation in Plan A4 |
+| `Basis<T>` trait with parametric `apply` | R-1, M:E-A | **canon-fixed** (R-1 trait shape committed); **implementation** scheduled in Plan A4 |
 | EWA splat rasterizer as `Basis<f16>` impl | Plan E | scheduled |
-| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit |
-| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | landed |
+| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit rule (doc commitment; CI check pending) |
+| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | **canon-fixed** (R-2 commits bits 0-1 = `header_kind`); wire-format implementation in Plan A8 |
 | Plan G video lane validates per-arch latency | R-4, R-11 | scheduled |
-| Federated codebook policy for scene anchors | R-13 | landed |
+| Federated codebook policy for scene anchors | R-13 | **canon-fixed** (R-13 commits Option A: per-shard codebook for Plan F v1); implementation in Plan F |
+
+**"Canon-fixed"** = the resolution doc commits the design; **"scheduled"** = the implementation has a named plan card. None of the above have shipping code today.
 
 The path to x266-like capability is: