Skip to content

Commit 6f96a14

Browse files
committed
docs(pr-x12): close 6 CodeRabbit nitpicks left open at PR #197 merge
Per other-session feedback: six nitpick-level findings on PR #197 didn't block the merge but stayed unaddressed. Folding them into this PR. 1. GGUF Escape forward-ref to F-4 gguf-llm-weights-encoding.md §2.4 said "Escape must be lossless ... This is an additional R-N candidate" with no pointer. F-4 in §10 already explains the mechanism (rANS bypass channel in A8, HEVC escape-coefficient precedent). Added an inline cross-ref so readers don't have to scroll to find the resolution. 2. Phone-class viability overclaim re KV cache gguf-llm-weights-encoding.md line 269 claimed "7B at PR-X12 is genuinely runnable on a phone-class device". Weight compression alone takes 7B from 4 GB to 3 GB, but KV cache at 8K context is ~1-2 GB independent of weight compression. Qualified the claim: PR-X12 weights are necessary but not sufficient; KV-cache lane (Plan D, M:H-3, R-4) is the second lever for full phone viability. 3. EncodingDomain::LLMWeights timing §11 implication #2 says "LLM lane lands post-PR-X12, but the harness must be lane-extensible"; implication #5 said "Reserve an EncodingDomain::LLMWeights discriminant ... now". Clarified: PR-X12 reserves the enum-discriminant *slot* now (forward-compat lock); the LLMWeights variant + decoder land post-PR-X12 without a wire-format break. 4. Per-arch `match Arch::CURRENT` const-eval woa-multiarch-orchestration.md §3.3's `const DCT_BATCH_CROSSOVER = match Arch::CURRENT { ... }` does not compile under stable Rust const-eval — `Arch::CURRENT` would need to be a const, and architecture-conditional const matches require build-script-emitted integers or `cfg!(target_feature = ...)`. Rewrote as pseudocode pointing at a `build.rs` mechanism (decision matrix → `OUT_DIR` generated const) with a `cfg!()` fallback shape. 5. G-8 / G-9 numbering collision cam-pq-sigker-dn-tree-substrate-bindings.md §5 labelled bgz-jc's two prior gaps as G-8 / G-9 (continuing cam-pq's own G-1..G-7), but bgz-jc-substrate-synergies.md §5 didn't use any G-N IDs, so the cross-doc reference was dangling and the namespace was implicitly shared without rules. Gave bgz-jc §5.1 / §5.2 explicit IDs G-1 / G-2 (canonical to that doc) and updated cam-pq to cite them as "bgz-jc G-1" / "bgz-jc G-2" with a namespace-isolation note. 6. "landed" terminology in x266 §8 prerequisites table The status column claimed "landed" / "landed in concept" for R-1 trait shape, R-2 header bytes, and R-13 federated codebook policy. None of these have shipping code — they are canon-fixed (the resolution doc commits the design) but implementation is scheduled in Plan A4 / A8 / F. Renamed status to "canon-fixed" + a glossary line distinguishing "canon-fixed" (doc commitment) from "scheduled" (plan card exists) from shipping code. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u
1 parent 8415a62 commit 6f96a14

5 files changed

Lines changed: 41 additions & 23 deletions

.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -303,7 +303,7 @@ This is the doc-level value of PR-X12: bgz code + PR-X12 docs = a complete archi
303303

304304
## 5. Gaps — what doesn't exist yet
305305

306-
### 5.1 `jd-nd` — the missing ndarray-side proof crate
306+
### 5.1 `jd-nd` — the missing ndarray-side proof crate (Gap **G-1**)
307307

308308
The Explore search confirmed: `jd-nd` does not exist in `/home/user/ndarray/`. The math-proof infrastructure on the ndarray side lives ad-hoc inside `src/hpc/` modules (`deepnsm.rs`, `jina/runtime.rs`) as TODO comments.
309309

@@ -335,7 +335,7 @@ ndarray/crates/jd-nd/
335335

336336
**Why now:** R-11's latency CI needs a *correctness* twin. Latency that's fast but wrong is the worst outcome. jd-nd is the structural place for those proofs.
337337

338-
### 5.2 Cronbach / ICC research crate
338+
### 5.2 Cronbach / ICC research crate (Gap **G-2**)
339339

340340
`lance-graph/crates/lance-graph-codec-research/` exists per the Explore agent's report, **but its scope is FFT (rustfft) variants**, not Cronbach's α / ICC / encoding-reliability psychometrics.
341341

.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -308,12 +308,14 @@ Updating the inventory from `pr-x12-bgz-jc-substrate-synergies.md` §7 with the
308308

309309
**Total estimated gap-closing work: 8-12 weeks** across the seven items, all incremental on existing infrastructure. None of them require new research; all are wiring existing primitives into the codec.
310310

311-
Two prior gaps from the earlier doc remain:
311+
Two prior gaps from the earlier doc remain (their canonical IDs are owned by `pr-x12-bgz-jc-substrate-synergies.md` §5; cross-referenced here):
312312

313-
| Gap (prior) | Component | Cost |
313+
| Gap (cross-ref) | Component | Cost |
314314
|---|---|---|
315-
| **G-8** | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing |
316-
| **G-9** | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC |
315+
| **bgz-jc G-1** (§5.1) | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing |
316+
| **bgz-jc G-2** (§5.2) | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC |
317+
318+
The G-1..G-7 IDs in §5 of *this* doc are local to the cam-pq / sigker / dn_tree binding; bgz-jc's G-1 / G-2 are a separate namespace owned by that doc. When citing cross-doc, prefix with the source (e.g., "bgz-jc G-1" vs "cam-pq G-1") to avoid the collision the previous G-8 / G-9 labelling implied.
317319

318320
**Grand total: ~11-17 weeks** of substrate-binding + gap-closing work, parallel-able. PR-X12 codec body (~1500 LoC per R-3) is independent of this and can ship sooner.
319321

.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ Crucially, the residual is **rANS-coded with a Gaussian-tail prior** (R-10). GGU
131131

132132
For weights that are too extreme to fit any basin (the activation outliers that LLM.int8() and SmoothQuant fight over), encode as Escape + raw f16 value. ~3-5% of weights per layer, but they carry disproportionate information.
133133

134-
The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate.
134+
The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate; see §10 falsifier **F-4** for the wire-format mechanism (rANS bypass channel in the A8 framing layer) and the HEVC-escape-coefficient precedent.
135135

136136
---
137137

@@ -266,7 +266,9 @@ Per GEMM operation (e.g., compute attn_q @ x for batch):
266266

267267
The CTU bitstream is read forward-only (rANS is a streaming codec) and the decoded weights live in L1/L2 cache just long enough to be GEMM'd. **No full-tensor dequantize buffer needed.** For a 4096 × 4096 attention projection, the dequantize buffer would be 32 MB (f16); PR-X12 streams in ~3-4 MB of bitstream, decodes to ~64 KB cache-resident windows, GEMMs each window, drops it.
268268

269-
**Memory savings:** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." A 7B model at PR-X12 is genuinely runnable on a phone-class device, where GGUF Q4 is borderline.
269+
**Memory savings (weights only):** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch."
270+
271+
**Phone-class caveat — weights are not the only memory load.** The KV cache scales with context length and is independent of weight compression: for a 7B model at 8K context, KV cache is ~2 GB in fp16 / ~1 GB in int8, and grows linearly with context. PR-X12 weight compression alone takes a 7B from "borderline" to "easier" on phone-class hardware, but **the KV cache lane (Plan D, M:H-3, R-4) is the second lever** that has to compress for full phone-class viability at non-trivial context. Both lanes are needed; this lens only addresses the weights side.
270272

271273
**Latency:** the streaming decode happens in the same loop body as the GEMM accumulate. On a modern arch with VNNI + AMX, the decode cost (~5-10 cycles per cell, branchless via R-1's lookup-table pattern) is hidden by GEMM latency. **Estimated overhead: < 5% versus pre-dequantized GEMM.**
272274

@@ -345,7 +347,7 @@ Concrete implications:
345347

346348
4. **Do** keep R-13's federated codebook policy. The LLM use case is the strongest motivation: per-model codebooks are 13 MB; without R-13, a hard-coded codebook would not work for arbitrary LLMs.
347349

348-
5. **Reserve** an `EncodingDomain::LLMWeights` discriminant in the codec metadata header (separate from the 16-bit per-CTU header). The codec body doesn't read this — it just stamps the file with a domain tag so decoders know which basin codebook to load.
350+
5. **Reserve** the *enum-discriminant slot* for `EncodingDomain::LLMWeights` in the codec metadata header *now*, even though the actual LLM-lane decoder lands post-PR-X12 (per implication #2). The header reserves a fixed-size domain-tag field (separate from the 16-bit per-CTU header); the LLMWeights value of that field stays unimplemented in PR-X12, but the slot is forward-compatibility-locked so a future PR can add the variant without a wire-format break. The codec body doesn't read this — it stamps the file with a domain tag so decoders know which basin codebook to load.
349351

350352
6. **Bench against AWQ at parity perplexity, not just Q4_K_M.** Q4_K_M is a conservative baseline; AWQ + GPTQ are the actual state of the art. If PR-X12 can match AWQ at smaller storage, the case is strong; if not, ship at "drop-in GGUF replacement" framing only.
351353

.claude/knowledge/pr-x12-woa-multiarch-orchestration.md

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -130,17 +130,29 @@ pub fn batched_gemm(input: ...) {
130130

131131
### 3.3 Per-arch tunable crossover (R-5 generalised)
132132

133-
Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch:
133+
Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target arch at *compile time* (via `target_arch` / `target_feature` cfgs + a feature-detection probe) and emits the chosen integer as a generated `const` in `OUT_DIR`:
134134

135135
```rust
136-
const DCT_BATCH_CROSSOVER: usize = match Arch::CURRENT {
137-
Arch::SapphireRapids => 64, // AMX wins above this
138-
Arch::IceLakeServer => 32, // AVX-512 narrower; lower crossover
139-
Arch::Zen4 => 96, // Zen's AVX-512 emulation widens crossover
140-
Arch::AppleM3 => 256, // NEON's narrower; only worth at large N
141-
Arch::GravitonV3 => 128, // SVE2 mid-range
142-
Arch::Generic => usize::MAX, // Always scalar fallback
143-
};
136+
// Shape of the per-arch table (lives in a build-script-generated file
137+
// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))):
138+
//
139+
// pub const DCT_BATCH_CROSSOVER: usize = 64; // emitted by build.rs
140+
// // for Sapphire Rapids
141+
//
142+
// The build script's decision matrix:
143+
// Sapphire Rapids (target_feature = "avx512f,amx-bf16") → 64
144+
// Ice Lake / Skylake-X (avx512f only) → 32
145+
// Zen 4 (avx512f, no AMX) → 96
146+
// Apple Silicon (target_arch = "aarch64" + NEON) → 256
147+
// Graviton 3 (aarch64 + SVE2) → 128
148+
// Generic / no SIMD → usize::MAX
149+
//
150+
// Equivalent fallback if a future Rust stabilises const target-feature
151+
// detection, then this can become a runtime-stable const:
152+
// const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 }
153+
// else if cfg!(target_feature = "avx512f") { 32 }
154+
// else if cfg!(target_arch = "aarch64") { 128 }
155+
// else { usize::MAX };
144156

145157
pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
146158
if N >= DCT_BATCH_CROSSOVER {
@@ -151,7 +163,7 @@ pub fn dct_apply<const N: usize>(input: &[i16], output: &mut [i16]) {
151163
}
152164
```
153165

154-
R-5 commits these crossovers as **bench-tunable constants**, not hand-guessed numbers. Plan G's codec-bench includes a calibration sub-target that emits the right `const` values per arch via build script.
166+
R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum. The build script is the single source of truth for which integer compiles in.
155167

156168
---
157169

.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -268,12 +268,14 @@ Nothing in this doc is in PR-X12 scope. What it requires from PR-X12:
268268

269269
| Requirement | Source | Status |
270270
|---|---|---|
271-
| `Basis<T>` trait with parametric `apply` | R-1, M:E-A | landed in concept; implementation in Plan A4 |
271+
| `Basis<T>` trait with parametric `apply` | R-1, M:E-A | **canon-fixed** (R-1 trait shape committed); **implementation** scheduled in Plan A4 |
272272
| EWA splat rasterizer as `Basis<f16>` impl | Plan E | scheduled |
273-
| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit |
274-
| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | landed |
273+
| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit rule (doc commitment; CI check pending) |
274+
| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | **canon-fixed** (R-2 commits bits 0-1 = `header_kind`); wire-format implementation in Plan A8 |
275275
| Plan G video lane validates per-arch latency | R-4, R-11 | scheduled |
276-
| Federated codebook policy for scene anchors | R-13 | landed |
276+
| Federated codebook policy for scene anchors | R-13 | **canon-fixed** (R-13 commits Option A: per-shard codebook for Plan F v1); implementation in Plan F |
277+
278+
**"Canon-fixed"** = the resolution doc commits the design; **"scheduled"** = the implementation has a named plan card. None of the above have shipping code today.
277279

278280
The path to x266-like capability is:
279281

0 commit comments

Comments
 (0)