From 8415a627eaa932f74ad9c63bf79a3bb4fa8bb659 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 22 May 2026 17:26:46 +0000 Subject: [PATCH 1/5] =?UTF-8?q?docs(pr-x12):=20apply=204=20canon=20updates?= =?UTF-8?q?=20from=20cam-pq=20doc=20=C2=A76=20(R-7=20/=20R-13=20/=20R-14?= =?UTF-8?q?=20/=20R-15)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The substrate-binding doc pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md §6 proposed four canon updates that PR #197 marked "should be added to canon-resolutions-delta" but never actually applied. Closing that loop. 1. R-7 path correction (both canon files): tropical-GEMM kernel lives today at lance-graph::bgz17::scalar_sparse:: tropical_spmv, not in an abstract blasgraph namespace. The blasgraph name is the eventual abstraction (post-Plan-H); until then ndarray-codec depends on bgz17 directly. Cite the symbol, not the namespace, when wiring A6. 2. R-13 expansion (both canon files): Name the implementation primitives composed by the four policy modes (LocalEphemeral / SharedClusterWide / SharedRegional / PretrainedStatic): - cam_pq::CamCodebook for training (k-means + CAM-PQ) - bgz-tensor::Codebook4096 / bgz-hhtl-d for deployed encoding - dn_tree for online plastic updates (SharedClusterWide) - merkle_tree for integrity proof (Blake3-48-bit + xor_diff) - q2 (external) for gossip protocol PR-X12 contributes the wire format + CodebookHandle trait + Option A. No new substrate code required. 3. R-14 (NEW, both canon files): Formal-correctness layer via lance-graph::jc pillars: - Pillar 10 (Pflug-Pichler, jc::pflug): nested-distance Lipschitz on Sigma DN-trees proves CAM-PQ tree quantization preserves FreeEnergy within Lε. Active in default zero-dep build. - Pillar 11 (Hambly-Lyons, jc::hambly_lyons): signature uniqueness on tree-quotient. Active under --features hambly-lyons (PR #348, 2026-05-07); probe passes (forward<1e-9, converse>0.05, ratio≥1e6). R-4 quality floors inherit Pillar 10's Lipschitz bound. R-15 gates on Pillar 11. PR #350 corrects sigker::signature_kernel_pde's Goursat-PDE math bug; until then Pillar 11 uses signature_truncated. 4. R-15 (NEW, both canon files): SignatureBasis: Basis as fifth concrete Basis impl alongside DctIIBasis (video), EwaSplatBasis (3DGS), ShSpectralBasis (splat SH), HadamardBasis. Wraps sigker::signature_truncated (tensor-algebra, correct today) — NOT signature_kernel_pde (buggy until PR #350). Plan G gets a fifth lane: stream signal (audio waveform / time-series / gesture / handwriting). Quality floor inherits from Pillar 11 (R-14). Compression target ~10× over raw f32 path samples (calibrate during Plan G). Also updated the compaction-preservation contract, falsifiability matrix (+3 rows for R-14/R-15), and §0 to reflect R-1..R-15 inclusive. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u --- .../pr-x12-canon-resolutions-delta.md | 66 ++++++- .../pr-x12-substrate-canon-resolutions.md | 164 +++++++++++++++++- 2 files changed, 214 insertions(+), 16 deletions(-) diff --git a/.claude/knowledge/pr-x12-canon-resolutions-delta.md b/.claude/knowledge/pr-x12-canon-resolutions-delta.md index ad7b923f..fd6795b7 100644 --- a/.claude/knowledge/pr-x12-canon-resolutions-delta.md +++ b/.claude/knowledge/pr-x12-canon-resolutions-delta.md @@ -9,15 +9,16 @@ ## 0. What's actually new -The merged canon (`bc9da4ad`) argued the architecture; canon-resolutions makes it falsifiable. Five categories of novel content survive the delta filter: +The merged canon (`bc9da4ad`) argued the architecture; canon-resolutions makes it falsifiable. Six categories of novel content survive the delta filter: 1. **Concrete trait signatures** — R-1 (`Basis` + `LinearReduce` split), §8 surface (`PredictiveSignal`, `CurveOrder`, `RdoMetric`) 2. **Quantified budgets** — R-3 LoC envelope per sub-card / per consumer + audit rule; R-4 four Plan G thresholds; R-11 4K@60fps latency budget -3. **Math identities** — R-6 SSD-via-VNNI (`||A||² - 2A·B + ||B||²`), R-7 tropical-GEMM partition (`O(4^d) → O(d²)`) +3. **Math identities** — R-6 SSD-via-VNNI (`||A||² - 2A·B + ||B||²`), R-7 tropical-GEMM partition (`O(4^d) → O(d²)`, kernel at `bgz17::scalar_sparse::tropical_spmv`) 4. **Type-level invariants** — R-2 bit-15/bit-14 split, R-9 topology-FREE codec -5. **Phasing patterns** — R-8 confidence-gate framing, R-13 Option-A-then-B for federated codebook +5. **Phasing patterns** — R-8 confidence-gate framing, R-13 Option-A-then-B for federated codebook (primitives: `cam_pq` + `bgz-hhtl-d` + `dn_tree` + `merkle_tree`) +6. **Formal-correctness + stream lane (post-merge)** — R-14 (`jc::pflug` Pillar 10 + `jc::hambly_lyons` Pillar 11), R-15 (`SignatureBasis` as fifth Plan G lane) -Plus the synthesis layer: §9 falsifiability matrix (24 rows), §10 sequencing with named gates, §12 compaction-preservation contract. +Plus the synthesis layer: §9 falsifiability matrix (24+3 rows including R-14/R-15), §10 sequencing with named gates, §12 compaction-preservation contract. --- @@ -216,7 +217,9 @@ Tropical-semiring (+, min) formulation: At 4K 132K CTUs/frame: ~4 ms vs ~64 ms just for partition RDO. At 60 fps, the difference between fitting and missing budget. -**Dep direction:** `ndarray-codec → lance-graph::blasgraph` (tropical-GEMM kernels live in blasgraph). Allowed post-Plan-H because ndarray-codec is a sibling crate, not the bottom. +**Dep direction:** `ndarray-codec → lance-graph::blasgraph` (tropical-GEMM kernels nominally live in blasgraph). Allowed post-Plan-H because ndarray-codec is a sibling crate, not the bottom. + +**Actual kernel home (current):** `lance-graph::bgz17::scalar_sparse::tropical_spmv`. The `blasgraph` namespace is the eventual abstraction; until that lands, ndarray-codec depends on bgz17 directly. Cite the symbol when wiring A6, not the namespace. **Plan A6 (1 week) ships this.** λ-RDO knob scales edge weights; tropical-GEMM relaxation computes optimal mode tree. @@ -292,6 +295,16 @@ Pattern: ship simplest-that-works, measure, escalate. Don't pick best-in-theory Wire-format hook for Option A: `WorkerId: u16` + `CodebookHash: u64` in frame header. +**Implementation primitives** (already exist; PR-X12 only adds the wire format + `CodebookHandle` trait): + +| Concern | Crate / module | +|---|---| +| Codebook training (k-means + CAM-PQ) | `ndarray::hpc::cam_pq::CamCodebook` | +| Deployed encoding format | `lance-graph::bgz-tensor::Codebook4096` / `bgz-hhtl-d` | +| Online plastic updates (SharedClusterWide) | `ndarray::hpc::dn_tree` | +| Integrity proof (Blake3-48 Merkle root, xor_diff) | `ndarray::hpc::merkle_tree` | +| Gossip protocol | `q2` (external) | + ### 5.3 Streaming flush granularity (R-12) Per-CTU default. `FlushUnit` 2-bit tag in frame header: @@ -405,9 +418,48 @@ Citation IDs (R-1..R-13) stable. Canon IDs (M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E- --- -## 11. The single load-bearing paragraph (§13) +## 11. Formal-correctness layer (R-14) — post-merge addition + +The substrate-binding doc (`pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md`) surfaced two formal proofs in `lance-graph::jc` that the codec inherits without re-proving: + +| Pillar | Crate / module | What it proves | Status | +|---|---|---|---| +| **Pillar 10** (Pflug-Pichler) | `jc::pflug` | Nested-distance Lipschitz on Sigma DN-trees: CAM-PQ tree quantization preserves FreeEnergy within Lε | Active in default zero-dep build | +| **Pillar 11** (Hambly-Lyons) | `jc::hambly_lyons` | Signature uniqueness on tree-quotient: any path of bounded variation is uniquely determined by its truncated signature up to tree-like equivalence (Annals 171(1), arXiv:math/0507536) | Active under `--features hambly-lyons` (PR #348, 2026-05-07); probe passes (forward<1e-9, converse>0.05, ratio≥1e6) | + +R-4's quality-floor rows for video / KV / gradient inherit Pillar 10's Lipschitz bound. R-15's signature lane gates on Pillar 11. + +**Open work (G-4):** PR #350 corrects `sigker::signature_kernel_pde`'s known Goursat-PDE math bug; Pillar 11's probe deliberately uses `signature_truncated` (tensor-algebra) until PR #350 lands. Production-scale benchmarking pending. + +--- + +## 12. Stream-signal codec lane (R-15) — post-merge addition + +`SignatureBasis: Basis` is the fifth concrete `Basis` impl, complementing the four lanes in §1's table: + +```rust +// New: ndarray::hpc::signature (~1 wk, wraps sigker::signature_truncated) +impl Basis for SignatureBasis { + fn dim(&self) -> usize { /* truncated tensor-algebra dim */ } + fn apply(&self, path: &[f32], signature: &mut [f32]) { + // iterated-integral truncation via sigker::signature_truncated + } + fn invert(&self, _sig: &[f32], _path: &mut [f32]) { + unimplemented!("path-from-signature is unique only up to tree-like \ + equivalence per R-14 Pillar 11") + } +} +``` + +**Plan G gets a fifth lane: "stream signal"** — audio waveforms / time-series / gesture / handwriting paths. Codec is `SignatureBasis` + standard rANS over the four-mode taxonomy; quality floor inherits from Pillar 11 (R-14); compression target ~10× over raw f32 path samples (calibrate during Plan G). + +**Why `signature_truncated` not `signature_kernel_pde`:** the PDE form ships a known divergence bug (PR #350). The tensor-algebra path is correct today and is what Pillar 11 cites. + +--- + +## 13. The single load-bearing paragraph (canon-resolutions §13) -> *The merged canon committed to the right architectural synthesis (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts unsigned. Canon-resolutions commits them: `Basis` + `LinearReduce` are two traits not one (R-1); bit 14 of the leaf header is consumer-typed and bit 15 universal (R-2); generic codec body ≤1500 LoC with ≤200 LoC per consumer (R-3); four threshold pairs gate Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) → Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel (3 wks); end state is one binary, four loads, ~2 KLoC stack demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim in §9 has a test; Plan G's bench-harness binary is the audit. The falsifiability is the point.* +> *The merged canon committed to the right architectural synthesis (M:E-A, M:E-D, M:E-G, M:E-I) but left the load-bearing contracts unsigned. Canon-resolutions commits them: `Basis` + `LinearReduce` are two traits not one (R-1); bit 14 of the leaf header is consumer-typed and bit 15 universal (R-2); generic codec body ≤1500 LoC with ≤200 LoC per consumer (R-3); four threshold pairs gate Plan G's pass criteria (R-4); the trajectory is Plan G (2 wks) → Plan A7 critical path (1.5 wks) → Phase 2 consumers parallel (3 wks); end state is one binary, four loads, ~2 KLoC stack demonstrating M:H-NEW-1 in ~10.5 weeks of wall-clock. Every claim in §9 has a test; Plan G's bench-harness binary is the audit. The falsifiability is the point. The substrate-binding follow-up (R-14, R-15) adds a formal-correctness layer via `jc` pillars and a fifth stream-signal lane via `SignatureBasis`.* --- diff --git a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md index 5bd633ba..ab8ebb28 100644 --- a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md +++ b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md @@ -24,8 +24,11 @@ were raised in review: (R-5 through R-7 restorations) - **§6** — three pieces of detail from session B the merge underrepresented (R-8 through R-10 restorations) -- **§7** — three commitments missing from both originals and from the - merge (R-11 through R-13 new specs) +- **§7** — five commitments missing from both originals and from the + merge: R-11 through R-13 (latency, flush granularity, federated + codebook) plus R-14 (formal correctness via `jc` pillars) and R-15 + (`SignatureBasis` as fifth Plan G lane), the latter two + surfaced post-merge by the substrate-binding docs Then five integration pieces that make the resolutions actionable: @@ -36,9 +39,10 @@ Then five integration pieces that make the resolutions actionable: - **§11** — end-state + trajectory (think it from the end) - **§12** — compaction-preservation contract -Citation IDs: `R-1` through `R-13` for resolutions. Canon IDs (`M:E-*`, -`A:E-*`, `B:E-*`, `M:H-*`, `M:T-*`) remain stable; this doc adds, does -not renumber. +Citation IDs: `R-1` through `R-15` for resolutions (R-14, R-15 +appended post-merge from the substrate-binding doc; numbering remains +append-only). Canon IDs (`M:E-*`, `A:E-*`, `B:E-*`, `M:H-*`, `M:T-*`) +remain stable; this doc adds, does not renumber. Sister docs (read order): @@ -543,6 +547,14 @@ ships tropical-GEMM kernels. No new code in ndarray; cross-repo dep from ndarray-codec → lance-graph::blasgraph (after Plan H extraction, this is dep-allowed because ndarray-codec is a sibling, not the bottom). +**Actual kernel home (current).** The tropical-GEMM kernel lives today +at `lance-graph::bgz17::scalar_sparse::tropical_spmv` — NOT in an +abstract `blasgraph` namespace. The codec's tropical-GEMM call is +`bgz17::scalar_sparse::tropical_spmv(edge_weights, dag)`. The +`lance-graph::blasgraph` name above is the eventual abstraction layer +(post-Plan-H extraction); until that lands, ndarray-codec depends on +bgz17 directly. Cite the symbol, not the namespace, when wiring A6. + **Plan A6 RDO (1 week) ships this.** The λ-RDO knob (per A:§10.3) and the tropical-GEMM partition solver are the same kernel: λ scales the edge weights, the relaxation computes the optimal mode tree. @@ -935,10 +947,135 @@ empirically; v3 (research-grade) tries Option C. R-4 gradient threshold (8× compression at <0.5% loss delta). At that point, Plan F v1 escalates to Option B in a follow-up PR. +**Implementation primitives (current substrate, no new code required):** + +| Concern | Crate / module | +|---------|----------------| +| Codebook training (k-means + CAM-PQ) | `ndarray::hpc::cam_pq::CamCodebook` (`train_geometric` / `train_semantic` / `train_hybrid`) | +| Deployed encoding format (per-shard) | `lance-graph::bgz-tensor::Codebook4096` and the `bgz-hhtl-d` shared-palette variant | +| Online plastic updates (`SharedClusterWide`) | `ndarray::hpc::dn_tree` (quaternary plastic memory, partial-Hamming descent) | +| Integrity proof for distributed updates | `ndarray::hpc::merkle_tree` (Blake3-48-bit, 1 KB root, `xor_diff` panCAKES compression) | +| Gossip protocol (cluster-wide) | `q2` (external — implements the wire protocol) | + +The four policy modes (`LocalEphemeral` / `SharedClusterWide` / +`SharedRegional` / `PretrainedStatic`) compose these primitives +differently; the codec body exposes a `CodebookHandle` trait, and the +primitives plug in via that trait. **PR-X12 contributes the wire format ++ trait + Option A; the primitives above already exist.** + **Cite as R-13 in Plan F PR description.** --- +### R-14 — Formal correctness via `lance-graph::jc` pillars + +**Problem.** Canon and resolutions describe the codec's empirical +behaviour (R-4 thresholds, R-11 latency) but never name the formal +correctness proofs the substrate already carries. Without a citation, +"the codec is correct" is unverifiable; with citations, the codec +inherits machine-checked guarantees from existing crates. + +**Resolution.** Pin both pillars and what each proves. + +**Two formal proofs in `lance-graph::jc`:** + +- **Quantization correctness (Pillar 10, Pflug-Pichler):** + nested-distance Lipschitz on Sigma DN-trees. Proves that CAM-PQ tree + quantization preserves the FreeEnergy functional within a Lipschitz + factor Lε. **This is the proof PR-X12 cites for "wire-format + quantization is faithful."** Implementation: `jc::pflug` (active in + default build, zero-dep). +- **Path-signature correctness (Pillar 11, Hambly-Lyons):** + signature uniqueness on tree-quotient. Proves that any path of + bounded variation is uniquely determined by its truncated signature + up to tree-like equivalence (Annals of Mathematics 171(1):109–167, + arXiv:math/0507536). **This is the proof PR-X12 cites for the + `SignatureBasis` lane (R-15).** Implementation: + `jc::hambly_lyons` (active under `--features hambly-lyons`, since + PR #348 landed on 2026-05-07). + +**What the codec inherits.** Both pillars exist; the codec cites them +and does not reprove. R-4's "Quality floor" rows for video / KV / +gradient inherit Pillar 10's Lipschitz bound automatically. R-15's +signature-lane gates on Pillar 11. + +**Status.** + +- Pillar 10: active in default zero-dep build. +- Pillar 11: active under `--features hambly-lyons`; passes its probe + (forward < 1e-9, converse > 0.05, discrimination ratio ≥ 1e6 over + N=100 random pairs in d=3 at depth-2). +- Production-scale benchmarking + PR #350 (`signature_kernel_pde` + Goursat-PDE math correction) remain open — see Gap G-4 in + `pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md`. Pillar 11's + probe deliberately uses `signature_truncated` (tensor-algebra path), + not the buggy PDE form. + +**Falsifies if.** Pillar 10 ever flips state (a regression in the +Pflug-Pichler proof bound) — Plan G's video / KV / gradient quality +floors lose their formal underwriting and become empirical-only. + +**Cite as R-14 in any PR claiming "codec output is faithful to +input" or wiring `SignatureBasis` (R-15).** + +--- + +### R-15 — `SignatureBasis` as `Basis` impl + +**Problem.** R-1 commits the `Basis` shape; the canon lists three +concrete impls (`DctIIBasis` for video, `EwaSplatBasis` for 3DGS, +`ShSpectralBasis` for splat SH). No `Basis` impl targets +*streams* — audio waveforms, time-series, gesture/handwriting paths. +Plan G has only four lanes; path-structured signals are unaddressed. + +**Resolution.** Commit `SignatureBasis: Basis` +as the fifth concrete impl, wrapping the path-signature kernel from +the external `lance-graph::sigker` crate. + +```rust +// Concrete impl, lives in ndarray::hpc::signature (new module, ~1 wk) +impl Basis for SignatureBasis { + fn dim(&self) -> usize { /* truncated tensor-algebra dim at DEPTH */ } + fn apply(&self, path: &[f32], signature: &mut [f32]) { + // iterated-integral truncation against sigker::signature_truncated + } + fn invert(&self, _sig: &[f32], _path: &mut [f32]) { + // signature → path is many-to-one (tree-quotient); document as N/A + unimplemented!("signature inversion is N/A — path unique only up to \ + tree-like equivalence per R-14 / Pillar 11") + } +} +``` + +**Why `signature_truncated` and not `signature_kernel_pde`.** The +PDE form in sigker ships a known math bug (PR #350: Goursat-PDE form +diverges from the true kernel `I₀(2·√⟨u, v⟩)` at moderate inner +products). The tensor-algebra path (`signature_truncated`) is correct +today and is what jc Pillar 11 cites. R-15 wraps the truncated path; +the PDE form becomes available after PR #350 lands. + +**Plan G gets a fifth lane.** "Stream signal" mode: + +- Input: audio waveform / time-series / gesture stream +- Codec: `SignatureBasis` truncates path signature, residuals + go through standard rANS via the four-mode taxonomy +- Quality floor: signature-uniqueness preservation per Pillar 11 +- Compression target: ~10× over raw f32 path samples (estimate; + calibrate during Plan G) + +**Falsifies if.** `SignatureBasis` plus rANS fails to +reconstruct the path within ε under Pillar 11's discrimination ratio. +At that point, raise DEPTH or fall back to per-block DCT-II for the +stream lane. + +**Cost.** ~1 week wrapper around `sigker::signature_truncated` + +basis-trait plumbing + Plan G fifth-lane wiring. + +**Cite as R-15 in any PR adding a stream-signal codec lane or +wiring `SignatureBasis`.** + +--- + ## 8. The canonical contracts — concrete trait signatures All three plug-points (per M:E-E) get concrete signatures here. These @@ -1108,6 +1245,9 @@ that decides whether each holy-grail claim is demonstrated. | R-11 (4K 60fps SIMD-batched) | this doc | Plan G video latency assert | Per-CTU encode time | ≤210 ns/CTU on Sapphire Rapids | | R-12 (per-CTU flush) | this doc | A8 frame-header parse + decode | First-CTU latency | First CTU decodable before frame complete | | R-13 (Option A per-shard) | this doc | Plan F on BERT-glue | 8× compression + accuracy | Holds; else escalate to Option B | +| R-14 (Pillar 10 active) | this doc | `cargo test -p jc` (default features) | Pflug-Pichler Lipschitz bound | Pillar 10 probe green | +| R-14 (Pillar 11 active) | this doc | `cargo test -p jc --features hambly-lyons` | Signature uniqueness probe | forward < 1e-9, converse > 0.05, ratio ≥ 1e6 | +| R-15 (SignatureBasis lane) | this doc | Plan G stream-signal lane | path reconstruction under Pillar 11 | ε within discrimination ratio of Pillar 11 | **Every row of this matrix is a test.** Plan G's bench harness binary emits a JSON report containing the actual measurement for each row; @@ -1234,7 +1374,7 @@ When this doc is summarised across context windows, preserve at minimum: 1. **The five "merged well" items** from §3 (M:E-A / M:E-D / M:E-G / M:E-I / M:E-F) — these are the canon's load-bearing synthesis. -2. **The thirteen resolutions** R-1 through R-13 with their summary +2. **The fifteen resolutions** R-1 through R-15 with their summary commitments: - R-1: `LinearReduce` two-trait shape - R-2: bit 15 universal, bit 14 consumer-typed @@ -1242,13 +1382,19 @@ When this doc is summarised across context windows, preserve at minimum: - R-4: 4 threshold pairs (video, splat, kv, grad) - R-5: DCT crossover ~64 blocks - R-6: SSD via VNNI ≥30× over SAD - - R-7: tropical-GEMM partition O(4^d) → O(d²) + - R-7: tropical-GEMM partition O(4^d) → O(d²) via + `bgz17::scalar_sparse::tropical_spmv` - R-8: Plan G is confidence gate - R-9: topology-FREE codec layer - R-10: ~4 bits/Gaussian near target, ~1 bit stretch - R-11: 210 ns/CTU SIMD-batched encode - R-12: per-CTU flush default; per-bucket Plan F - - R-13: Option A (per-shard codebook) for Plan F v1 + - R-13: Option A (per-shard codebook) for Plan F v1; primitives are + `cam_pq` + `bgz-hhtl-d` + `dn_tree` + `merkle_tree` + - R-14: formal correctness via `jc::pflug` (Pillar 10) + + `jc::hambly_lyons` (Pillar 11, feature-gated) + - R-15: `SignatureBasis: Basis` as fifth Plan G lane + (stream signal) 3. **The trajectory** from §2 — Phase 0 → A7 → parallelise → Phase 2 4. **The five-category architecture** including `ndarray-codec` 5. **The four traits** as the canonical contracts: @@ -1258,7 +1404,7 @@ When this doc is summarised across context windows, preserve at minimum: 7. **The falsifiability matrix in §9** — every claim has a test; not every claim will pass; that's the design -**Citation IDs in this doc** (R-1 .. R-13) are stable. Canon IDs +**Citation IDs in this doc** (R-1 .. R-15) are stable. Canon IDs (M:E-*, M:H-*, M:H-NEW-*, M:T-*, A:E-*, A:H-*, A:T-*, B:E-*, B:HG-*, B:D-*) remain stable per canon's §10. Append, never renumber. From 6f96a1472b2afa658a3b8d1471742fb99952fd73 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 22 May 2026 17:39:01 +0000 Subject: [PATCH 2/5] docs(pr-x12): close 6 CodeRabbit nitpicks left open at PR #197 merge MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per other-session feedback: six nitpick-level findings on PR #197 didn't block the merge but stayed unaddressed. Folding them into this PR. 1. GGUF Escape forward-ref to F-4 gguf-llm-weights-encoding.md §2.4 said "Escape must be lossless ... This is an additional R-N candidate" with no pointer. F-4 in §10 already explains the mechanism (rANS bypass channel in A8, HEVC escape-coefficient precedent). Added an inline cross-ref so readers don't have to scroll to find the resolution. 2. Phone-class viability overclaim re KV cache gguf-llm-weights-encoding.md line 269 claimed "7B at PR-X12 is genuinely runnable on a phone-class device". Weight compression alone takes 7B from 4 GB to 3 GB, but KV cache at 8K context is ~1-2 GB independent of weight compression. Qualified the claim: PR-X12 weights are necessary but not sufficient; KV-cache lane (Plan D, M:H-3, R-4) is the second lever for full phone viability. 3. EncodingDomain::LLMWeights timing §11 implication #2 says "LLM lane lands post-PR-X12, but the harness must be lane-extensible"; implication #5 said "Reserve an EncodingDomain::LLMWeights discriminant ... now". Clarified: PR-X12 reserves the enum-discriminant *slot* now (forward-compat lock); the LLMWeights variant + decoder land post-PR-X12 without a wire-format break. 4. Per-arch `match Arch::CURRENT` const-eval woa-multiarch-orchestration.md §3.3's `const DCT_BATCH_CROSSOVER = match Arch::CURRENT { ... }` does not compile under stable Rust const-eval — `Arch::CURRENT` would need to be a const, and architecture-conditional const matches require build-script-emitted integers or `cfg!(target_feature = ...)`. Rewrote as pseudocode pointing at a `build.rs` mechanism (decision matrix → `OUT_DIR` generated const) with a `cfg!()` fallback shape. 5. G-8 / G-9 numbering collision cam-pq-sigker-dn-tree-substrate-bindings.md §5 labelled bgz-jc's two prior gaps as G-8 / G-9 (continuing cam-pq's own G-1..G-7), but bgz-jc-substrate-synergies.md §5 didn't use any G-N IDs, so the cross-doc reference was dangling and the namespace was implicitly shared without rules. Gave bgz-jc §5.1 / §5.2 explicit IDs G-1 / G-2 (canonical to that doc) and updated cam-pq to cite them as "bgz-jc G-1" / "bgz-jc G-2" with a namespace-isolation note. 6. "landed" terminology in x266 §8 prerequisites table The status column claimed "landed" / "landed in concept" for R-1 trait shape, R-2 header bytes, and R-13 federated codebook policy. None of these have shipping code — they are canon-fixed (the resolution doc commits the design) but implementation is scheduled in Plan A4 / A8 / F. Renamed status to "canon-fixed" + a glossary line distinguishing "canon-fixed" (doc commitment) from "scheduled" (plan card exists) from shipping code. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u --- .../pr-x12-bgz-jc-substrate-synergies.md | 4 +-- ...am-pq-sigker-dn-tree-substrate-bindings.md | 10 +++--- .../pr-x12-gguf-llm-weights-encoding.md | 8 +++-- .../pr-x12-woa-multiarch-orchestration.md | 32 +++++++++++++------ .../pr-x12-x266-3dgs-spacetime-upscaling.md | 10 +++--- 5 files changed, 41 insertions(+), 23 deletions(-) diff --git a/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md index e715d403..313833af 100644 --- a/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md +++ b/.claude/knowledge/pr-x12-bgz-jc-substrate-synergies.md @@ -303,7 +303,7 @@ This is the doc-level value of PR-X12: bgz code + PR-X12 docs = a complete archi ## 5. Gaps — what doesn't exist yet -### 5.1 `jd-nd` — the missing ndarray-side proof crate +### 5.1 `jd-nd` — the missing ndarray-side proof crate (Gap **G-1**) The Explore search confirmed: `jd-nd` does not exist in `/home/user/ndarray/`. The math-proof infrastructure on the ndarray side lives ad-hoc inside `src/hpc/` modules (`deepnsm.rs`, `jina/runtime.rs`) as TODO comments. @@ -335,7 +335,7 @@ ndarray/crates/jd-nd/ **Why now:** R-11's latency CI needs a *correctness* twin. Latency that's fast but wrong is the worst outcome. jd-nd is the structural place for those proofs. -### 5.2 Cronbach / ICC research crate +### 5.2 Cronbach / ICC research crate (Gap **G-2**) `lance-graph/crates/lance-graph-codec-research/` exists per the Explore agent's report, **but its scope is FFT (rustfft) variants**, not Cronbach's α / ICC / encoding-reliability psychometrics. diff --git a/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md index 3ba29be3..e1deb77d 100644 --- a/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md +++ b/.claude/knowledge/pr-x12-cam-pq-sigker-dn-tree-substrate-bindings.md @@ -308,12 +308,14 @@ Updating the inventory from `pr-x12-bgz-jc-substrate-synergies.md` §7 with the **Total estimated gap-closing work: 8-12 weeks** across the seven items, all incremental on existing infrastructure. None of them require new research; all are wiring existing primitives into the codec. -Two prior gaps from the earlier doc remain: +Two prior gaps from the earlier doc remain (their canonical IDs are owned by `pr-x12-bgz-jc-substrate-synergies.md` §5; cross-referenced here): -| Gap (prior) | Component | Cost | +| Gap (cross-ref) | Component | Cost | |---|---|---| -| **G-8** | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing | -| **G-9** | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC | +| **bgz-jc G-1** (§5.1) | `jd-nd` crate does not exist (ndarray-side proof crate) | 2-3 weeks skeleton + ongoing | +| **bgz-jc G-2** (§5.2) | Cronbach/ICC encoding-reliability research crate not implemented | 1-2 weeks skeleton + 2-3 weeks PoC | + +The G-1..G-7 IDs in §5 of *this* doc are local to the cam-pq / sigker / dn_tree binding; bgz-jc's G-1 / G-2 are a separate namespace owned by that doc. When citing cross-doc, prefix with the source (e.g., "bgz-jc G-1" vs "cam-pq G-1") to avoid the collision the previous G-8 / G-9 labelling implied. **Grand total: ~11-17 weeks** of substrate-binding + gap-closing work, parallel-able. PR-X12 codec body (~1500 LoC per R-3) is independent of this and can ship sooner. diff --git a/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md index eda384c5..e1fb0c91 100644 --- a/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md +++ b/.claude/knowledge/pr-x12-gguf-llm-weights-encoding.md @@ -131,7 +131,7 @@ Crucially, the residual is **rANS-coded with a Gaussian-tail prior** (R-10). GGU For weights that are too extreme to fit any basin (the activation outliers that LLM.int8() and SmoothQuant fight over), encode as Escape + raw f16 value. ~3-5% of weights per layer, but they carry disproportionate information. -The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate. +The PR-X12 wire format already supports Escape as the lossy-fallback path (with the codec body warning per M:T new items). For LLM weights, Escape *must be lossless* — no truncation of outliers. This is an additional R-N candidate; see §10 falsifier **F-4** for the wire-format mechanism (rANS bypass channel in the A8 framing layer) and the HEVC-escape-coefficient precedent. --- @@ -266,7 +266,9 @@ Per GEMM operation (e.g., compute attn_q @ x for batch): The CTU bitstream is read forward-only (rANS is a streaming codec) and the decoded weights live in L1/L2 cache just long enough to be GEMM'd. **No full-tensor dequantize buffer needed.** For a 4096 × 4096 attention projection, the dequantize buffer would be 32 MB (f16); PR-X12 streams in ~3-4 MB of bitstream, decodes to ~64 KB cache-resident windows, GEMMs each window, drops it. -**Memory savings:** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." A 7B model at PR-X12 is genuinely runnable on a phone-class device, where GGUF Q4 is borderline. +**Memory savings (weights only):** on a memory-constrained edge device (8 GB RAM), this turns "loads 4 GB model + needs 1 GB dequant scratch" into "loads 3 GB model + needs 64 KB scratch." + +**Phone-class caveat — weights are not the only memory load.** The KV cache scales with context length and is independent of weight compression: for a 7B model at 8K context, KV cache is ~2 GB in fp16 / ~1 GB in int8, and grows linearly with context. PR-X12 weight compression alone takes a 7B from "borderline" to "easier" on phone-class hardware, but **the KV cache lane (Plan D, M:H-3, R-4) is the second lever** that has to compress for full phone-class viability at non-trivial context. Both lanes are needed; this lens only addresses the weights side. **Latency:** the streaming decode happens in the same loop body as the GEMM accumulate. On a modern arch with VNNI + AMX, the decode cost (~5-10 cycles per cell, branchless via R-1's lookup-table pattern) is hidden by GEMM latency. **Estimated overhead: < 5% versus pre-dequantized GEMM.** @@ -345,7 +347,7 @@ Concrete implications: 4. **Do** keep R-13's federated codebook policy. The LLM use case is the strongest motivation: per-model codebooks are 13 MB; without R-13, a hard-coded codebook would not work for arbitrary LLMs. -5. **Reserve** an `EncodingDomain::LLMWeights` discriminant in the codec metadata header (separate from the 16-bit per-CTU header). The codec body doesn't read this — it just stamps the file with a domain tag so decoders know which basin codebook to load. +5. **Reserve** the *enum-discriminant slot* for `EncodingDomain::LLMWeights` in the codec metadata header *now*, even though the actual LLM-lane decoder lands post-PR-X12 (per implication #2). The header reserves a fixed-size domain-tag field (separate from the 16-bit per-CTU header); the LLMWeights value of that field stays unimplemented in PR-X12, but the slot is forward-compatibility-locked so a future PR can add the variant without a wire-format break. The codec body doesn't read this — it stamps the file with a domain tag so decoders know which basin codebook to load. 6. **Bench against AWQ at parity perplexity, not just Q4_K_M.** Q4_K_M is a conservative baseline; AWQ + GPTQ are the actual state of the art. If PR-X12 can match AWQ at smaller storage, the case is strong; if not, ship at "drop-in GGUF replacement" framing only. diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md index 0da19ed7..cc3045c1 100644 --- a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md +++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md @@ -130,17 +130,29 @@ pub fn batched_gemm(input: ...) { ### 3.3 Per-arch tunable crossover (R-5 generalised) -Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch: +Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target arch at *compile time* (via `target_arch` / `target_feature` cfgs + a feature-detection probe) and emits the chosen integer as a generated `const` in `OUT_DIR`: ```rust -const DCT_BATCH_CROSSOVER: usize = match Arch::CURRENT { - Arch::SapphireRapids => 64, // AMX wins above this - Arch::IceLakeServer => 32, // AVX-512 narrower; lower crossover - Arch::Zen4 => 96, // Zen's AVX-512 emulation widens crossover - Arch::AppleM3 => 256, // NEON's narrower; only worth at large N - Arch::GravitonV3 => 128, // SVE2 mid-range - Arch::Generic => usize::MAX, // Always scalar fallback -}; +// Shape of the per-arch table (lives in a build-script-generated file +// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))): +// +// pub const DCT_BATCH_CROSSOVER: usize = 64; // emitted by build.rs +// // for Sapphire Rapids +// +// The build script's decision matrix: +// Sapphire Rapids (target_feature = "avx512f,amx-bf16") → 64 +// Ice Lake / Skylake-X (avx512f only) → 32 +// Zen 4 (avx512f, no AMX) → 96 +// Apple Silicon (target_arch = "aarch64" + NEON) → 256 +// Graviton 3 (aarch64 + SVE2) → 128 +// Generic / no SIMD → usize::MAX +// +// Equivalent fallback if a future Rust stabilises const target-feature +// detection, then this can become a runtime-stable const: +// const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 } +// else if cfg!(target_feature = "avx512f") { 32 } +// else if cfg!(target_arch = "aarch64") { 128 } +// else { usize::MAX }; pub fn dct_apply(input: &[i16], output: &mut [i16]) { if N >= DCT_BATCH_CROSSOVER { @@ -151,7 +163,7 @@ pub fn dct_apply(input: &[i16], output: &mut [i16]) { } ``` -R-5 commits these crossovers as **bench-tunable constants**, not hand-guessed numbers. Plan G's codec-bench includes a calibration sub-target that emits the right `const` values per arch via build script. +R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum. The build script is the single source of truth for which integer compiles in. --- diff --git a/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md index 14ba0f2d..b22eb80a 100644 --- a/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md +++ b/.claude/knowledge/pr-x12-x266-3dgs-spacetime-upscaling.md @@ -268,12 +268,14 @@ Nothing in this doc is in PR-X12 scope. What it requires from PR-X12: | Requirement | Source | Status | |---|---|---| -| `Basis` trait with parametric `apply` | R-1, M:E-A | landed in concept; implementation in Plan A4 | +| `Basis` trait with parametric `apply` | R-1, M:E-A | **canon-fixed** (R-1 trait shape committed); **implementation** scheduled in Plan A4 | | EWA splat rasterizer as `Basis` impl | Plan E | scheduled | -| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit | -| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | landed | +| Codec body decoupled from specific basis | M:H-NEW-2 LoC envelope | enforced via R-3 audit rule (doc commitment; CI check pending) | +| Header byte stable across basis swaps | R-2, M:E-J bits 0-1 | **canon-fixed** (R-2 commits bits 0-1 = `header_kind`); wire-format implementation in Plan A8 | | Plan G video lane validates per-arch latency | R-4, R-11 | scheduled | -| Federated codebook policy for scene anchors | R-13 | landed | +| Federated codebook policy for scene anchors | R-13 | **canon-fixed** (R-13 commits Option A: per-shard codebook for Plan F v1); implementation in Plan F | + +**"Canon-fixed"** = the resolution doc commits the design; **"scheduled"** = the implementation has a named plan card. None of the above have shipping code today. The path to x266-like capability is: From 1bb4561ff295c00f470f89eee6cc68e23e771be0 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 22 May 2026 18:03:52 +0000 Subject: [PATCH 3/5] docs(pr-x12): address CR + codex feedback on PR #198 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two open review threads on the second commit: 1. R-15 falsifiability row (substrate-canon-resolutions.md §9, line 1250) Both codex-connector and CodeRabbit flagged that the matrix row required "path reconstruction under Pillar 11" but R-15 itself marks invert() as unimplemented! (path-from-signature is many-to-one up to tree-like equivalence — Hambly-Lyons). The gate was operationally untestable. Reframed per CR's suggestion: "signature-space discrimination under Pillar 11 (forward-only — path inversion is N/A per R-15)" with the same probe criteria Pillar 11 uses (forward < 1e-9, converse > 0.05, discrimination ratio >= 1e6, or agreed DEPTH-specific floor). 2. build.rs host-vs-target semantics (woa-multiarch §3.3, line 133) CR pointed out that under cross-compilation Cargo runs build.rs on the HOST, so any "feature-detection probe" in build.rs reflects the build machine, not the target. My original wording ("target_arch / target_feature cfgs + a feature-detection probe") implied host CPU probing — wrong. Rewrote the pseudocode + surrounding text to explicitly use CARGO_CFG_TARGET_ARCH, CARGO_CFG_TARGET_FEATURE, target triple, and pre-recorded calibration artifacts. Added an explicit "do NOT probe the host CPU inside build.rs" warning citing Cargo's docs. The in-crate cfg!() fallback shape is still correct (cfg! in normal code reflects target cfgs); only build.rs's cfg!/#[cfg] reflects the host. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u --- .../pr-x12-substrate-canon-resolutions.md | 2 +- .../pr-x12-woa-multiarch-orchestration.md | 23 ++++++++++--------- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md index ab8ebb28..26e99042 100644 --- a/.claude/knowledge/pr-x12-substrate-canon-resolutions.md +++ b/.claude/knowledge/pr-x12-substrate-canon-resolutions.md @@ -1247,7 +1247,7 @@ that decides whether each holy-grail claim is demonstrated. | R-13 (Option A per-shard) | this doc | Plan F on BERT-glue | 8× compression + accuracy | Holds; else escalate to Option B | | R-14 (Pillar 10 active) | this doc | `cargo test -p jc` (default features) | Pflug-Pichler Lipschitz bound | Pillar 10 probe green | | R-14 (Pillar 11 active) | this doc | `cargo test -p jc --features hambly-lyons` | Signature uniqueness probe | forward < 1e-9, converse > 0.05, ratio ≥ 1e6 | -| R-15 (SignatureBasis lane) | this doc | Plan G stream-signal lane | path reconstruction under Pillar 11 | ε within discrimination ratio of Pillar 11 | +| R-15 (SignatureBasis lane) | this doc | Plan G stream-signal lane | signature-space discrimination under Pillar 11 (forward-only — path inversion is N/A per R-15) | forward < 1e-9, converse > 0.05, ratio ≥ 1e6 (or agreed DEPTH-specific floor) | **Every row of this matrix is a test.** Plan G's bench harness binary emits a JSON report containing the actual measurement for each row; diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md index cc3045c1..7e4b0c09 100644 --- a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md +++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md @@ -130,7 +130,7 @@ pub fn batched_gemm(input: ...) { ### 3.3 Per-arch tunable crossover (R-5 generalised) -Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target arch at *compile time* (via `target_arch` / `target_feature` cfgs + a feature-detection probe) and emits the chosen integer as a generated `const` in `OUT_DIR`: +Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target from compile-time metadata Cargo exposes to build scripts — `CARGO_CFG_TARGET_ARCH`, `CARGO_CFG_TARGET_FEATURE`, the target triple, and any pre-recorded calibration artifact — and emits the chosen integer as a generated `const` in `OUT_DIR`. **Critically, do NOT probe the host CPU inside `build.rs`**: under cross-compilation Cargo runs `build.rs` on the *host* machine, so any runtime feature-detection there reflects the build machine and not the target. Cargo's docs are explicit: use `CARGO_CFG_*` env vars (which correctly reflect the target) rather than `cfg!`/`#[cfg]` checks (which reflect the host the script runs on). ```rust // Shape of the per-arch table (lives in a build-script-generated file @@ -139,16 +139,17 @@ Some operations have a "small N: scalar, large N: SIMD" crossover that varies pe // pub const DCT_BATCH_CROSSOVER: usize = 64; // emitted by build.rs // // for Sapphire Rapids // -// The build script's decision matrix: -// Sapphire Rapids (target_feature = "avx512f,amx-bf16") → 64 -// Ice Lake / Skylake-X (avx512f only) → 32 -// Zen 4 (avx512f, no AMX) → 96 -// Apple Silicon (target_arch = "aarch64" + NEON) → 256 -// Graviton 3 (aarch64 + SVE2) → 128 -// Generic / no SIMD → usize::MAX +// The build script's decision matrix, driven entirely by Cargo's +// target-config env vars (no host CPU probing): +// CARGO_CFG_TARGET_FEATURE contains "amx-bf16" → 64 +// CARGO_CFG_TARGET_FEATURE contains "avx512f" → 32 (skylake-x/ice lake) +// CARGO_CFG_TARGET_FEATURE contains "avx512f", Zen-tuned target-cpu → 96 +// CARGO_CFG_TARGET_ARCH == "aarch64" + NEON-only → 256 +// CARGO_CFG_TARGET_ARCH == "aarch64" + SVE2 → 128 +// else → usize::MAX // -// Equivalent fallback if a future Rust stabilises const target-feature -// detection, then this can become a runtime-stable const: +// Equivalent in-crate fallback shape using cfg! (still target-resolved, +// since cfg! in normal (non-build-script) code uses target cfgs): // const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 } // else if cfg!(target_feature = "avx512f") { 32 } // else if cfg!(target_arch = "aarch64") { 128 } @@ -163,7 +164,7 @@ pub fn dct_apply(input: &[i16], output: &mut [i16]) { } ``` -R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum. The build script is the single source of truth for which integer compiles in. +R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum, and never via host CPU probing under cross-compilation. The build script (driven by `CARGO_CFG_TARGET_*`) is the single source of truth for which integer compiles in. --- From 8b1d3764f080cdd24ed046deb7a54ef8813aff77 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 22 May 2026 18:10:50 +0000 Subject: [PATCH 4/5] =?UTF-8?q?docs(pr-x12):=20purge=20runtime-dispatch=20?= =?UTF-8?q?creep=20from=20woa-multiarch=20=C2=A73?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous §3 violated the project's W1a polyfill contract by describing runtime CPU dispatch as if it were the substrate's design: - §3.1 listed multiple #[target_feature(enable = ...)] fns coexisting in one mod, framed as "compile-time" but read like "available to select from at runtime". - §3.2 had an HwCaps OnceLock + caps.has_amx / has_vnni / has_sve2 if-else chain with `unsafe { batched_gemm_amx(...) }` runtime branches. This is exactly the pattern CLAUDE.md and the W1a consumer contract forbid. - §3.3 framed Arch::CURRENT + build.rs as one option among several runtime/host-probe approaches, which kept the "branching is OK" mental model alive even after the host-vs-target fix. Rewritten to match the actual project pattern (per CLAUDE.md's Repository Structure + W1a contract): - "All dispatch is polyfill." cfg(target_feature = ...) selects exactly one backend file (simd_avx512.rs / simd_neon.rs / simd_scalar.rs) to compile in. No runtime branching. - Target CPU is fixed at build time via .cargo/config.toml's target-cpu = x86-64-v4 (AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. WoA fleet ships per-arch binaries, not a fat binary that probes. - R-5's per-arch crossover constant is also part of the polyfill: one const per backend file, cfg-selected. build.rs may emit a refined override into OUT_DIR (compile-time, target-config-driven, never host CPU probing) but the selection mechanism is still cfg. - Added explicit negative-rule sentence at the top of §3 listing forbidden patterns: no HwCaps / CpuCaps runtime branching, no `if has_avx512 else ...` dispatch, no `unsafe { runtime_branch }`. The previous wording was hallucinated branching that contradicted the substrate's actual design. The substrate ships ONE path per binary; the cfg selects which path at build time. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u --- .../pr-x12-woa-multiarch-orchestration.md | 102 ++++++------------ 1 file changed, 35 insertions(+), 67 deletions(-) diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md index 7e4b0c09..955b9ff1 100644 --- a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md +++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md @@ -74,97 +74,65 @@ This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tracta --- -## 3. Per-arch dispatch as a substrate property +## 3. Per-arch substrate via compile-time polyfill -The PR-X12 substrate (per merged canon §M:E-G, §M:E-H, R-4, R-5, R-11) implements per-arch dispatch via three mechanisms: +The PR-X12 substrate follows the project's W1a consumer contract (see `CLAUDE.md` and `.claude/knowledge/vertical-simd-consumer-contract.md`): **all dispatch is polyfill**. Per arch we ship a separate backend file with the same public surface, and `cfg(target_feature = ...)` selects exactly one to compile in. There is **no runtime CPU detection, no `HwCaps`/`CpuCaps` branching, no `if has_avx512 else …` dispatch, and no `unsafe { runtime_branch }` chain.** The target CPU is fixed at build time via `.cargo/config.toml` (`target-cpu=x86-64-v4` makes AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. One build, one path. -### 3.1 Compile-time `target_feature` +### 3.1 The polyfill primitive: cfg-selected per-arch files + +The pattern is the same one already shipping in `src/simd*.rs` (per `CLAUDE.md` Repository Structure): ```rust -// In ndarray::hpc::blas_level2::batched_gemm: +// src/simd.rs — consumer-facing surface, re-exports a single backend +#[cfg(target_feature = "avx512f")] +pub use crate::simd_avx512::*; -#[cfg(target_arch = "x86_64")] -mod x86_dispatch { - #[target_feature(enable = "avx512f,avx512bw,avx512vnni")] - pub unsafe fn batched_gemm_vnni(...) { /* VNNI path */ } +#[cfg(all(not(target_feature = "avx512f"), target_arch = "aarch64"))] +pub use crate::simd_neon::*; - #[target_feature(enable = "amx-tile,amx-int8,amx-bf16")] - pub unsafe fn batched_gemm_amx(...) { /* AMX path */ } -} +#[cfg(not(any(target_feature = "avx512f", target_arch = "aarch64")))] +pub use crate::simd_scalar::*; +``` -#[cfg(target_arch = "aarch64")] -mod arm_dispatch { - #[target_feature(enable = "sve2")] - pub unsafe fn batched_gemm_sve2(...) { /* SVE2 path */ } +Each backend file (`simd_avx512.rs`, `simd_neon.rs`, `simd_scalar.rs`) implements the same public functions with identical signatures. The W1a contract requires **all three backends + a parity test** before any new primitive lands. The codec body (`ndarray-codec`, see R-3) and downstream consumers (burn / candle / lance-graph / surrealdb / WoA fleet) call `ndarray::simd::*` directly — they never see or reason about which backend is active. The cfg substitutes one file at the use-site; consumer code is identical across architectures. - #[target_feature(enable = "neon,fp16")] - pub unsafe fn batched_gemm_neon_fp16(...) { /* Apple Silicon */ } -} -``` +### 3.2 Build-time CPU selection (not runtime detection) -### 3.2 Runtime feature detection (cached at process start) +Target CPU is decided once, at build time: -```rust -// In ndarray::hpc::capability: -pub static CAP: OnceLock = OnceLock::new(); - -pub struct HwCaps { - pub has_amx: bool, - pub has_vnni: bool, - pub has_sve2: bool, - pub has_neon_fp16: bool, - pub l1_cache_size: usize, - pub vec_width_bits: u16, - // ... more as new features land -} +| Mechanism | Source | Effect | +|---|---|---| +| `.cargo/config.toml` `target-cpu=x86-64-v4` | repo policy | AVX-512 mandatory on x86_64 (per `CLAUDE.md`) | +| `--target aarch64-apple-darwin` | CI / fleet build matrix | NEON-fp16 backend compiles in | +| `--target aarch64-unknown-linux-gnu` + SVE2 target-feature | Graviton build | SVE2 backend compiles in | -pub fn batched_gemm(input: ...) { - let caps = CAP.get().unwrap(); - if caps.has_amx { unsafe { batched_gemm_amx(input) } } - else if caps.has_vnni { unsafe { batched_gemm_vnni(input) } } - else if caps.has_sve2 { unsafe { batched_gemm_sve2(input) } } - // ... - else { batched_gemm_scalar(input) } -} -``` +The WoA fleet ships **per-arch binaries**, not a fat binary that probes. Q2 distributes the right binary to each node based on the node's already-known architecture (declared at registration time, not detected per request). Cross-arch determinism (§6 below) is enforced because each binary embeds exactly one backend and the W1a parity test gates every primitive at the substrate layer. -### 3.3 Per-arch tunable crossover (R-5 generalised) +### 3.3 Per-arch tunable crossover (R-5) -Some operations have a "small N: scalar, large N: SIMD" crossover that varies per arch. The snippet below is **pseudocode** — Rust's stable const-eval does not let `match` discriminate over a runtime-detected `Arch::CURRENT` value at `const` context. The real mechanism is a `build.rs` script that resolves the target from compile-time metadata Cargo exposes to build scripts — `CARGO_CFG_TARGET_ARCH`, `CARGO_CFG_TARGET_FEATURE`, the target triple, and any pre-recorded calibration artifact — and emits the chosen integer as a generated `const` in `OUT_DIR`. **Critically, do NOT probe the host CPU inside `build.rs`**: under cross-compilation Cargo runs `build.rs` on the *host* machine, so any runtime feature-detection there reflects the build machine and not the target. Cargo's docs are explicit: use `CARGO_CFG_*` env vars (which correctly reflect the target) rather than `cfg!`/`#[cfg]` checks (which reflect the host the script runs on). +Some operations (DCT-II vs GEMM, basin-lookup width, etc.) have a "small N: scalar path, large N: SIMD path" crossover whose break-even N varies per backend. The crossover lives in the **same polyfill** as the SIMD primitives: a `cfg(target_feature = ...)`-selected `const`. ```rust -// Shape of the per-arch table (lives in a build-script-generated file -// included via include!(concat!(env!("OUT_DIR"), "/arch_crossovers.rs"))): +// src/hpc/dct_crossover.rs — one const per backend file, cfg-selected // -// pub const DCT_BATCH_CROSSOVER: usize = 64; // emitted by build.rs -// // for Sapphire Rapids -// -// The build script's decision matrix, driven entirely by Cargo's -// target-config env vars (no host CPU probing): -// CARGO_CFG_TARGET_FEATURE contains "amx-bf16" → 64 -// CARGO_CFG_TARGET_FEATURE contains "avx512f" → 32 (skylake-x/ice lake) -// CARGO_CFG_TARGET_FEATURE contains "avx512f", Zen-tuned target-cpu → 96 -// CARGO_CFG_TARGET_ARCH == "aarch64" + NEON-only → 256 -// CARGO_CFG_TARGET_ARCH == "aarch64" + SVE2 → 128 -// else → usize::MAX -// -// Equivalent in-crate fallback shape using cfg! (still target-resolved, -// since cfg! in normal (non-build-script) code uses target cfgs): -// const DCT_BATCH_CROSSOVER: usize = if cfg!(target_feature = "amx-bf16") { 64 } -// else if cfg!(target_feature = "avx512f") { 32 } -// else if cfg!(target_arch = "aarch64") { 128 } -// else { usize::MAX }; +// simd_avx512.rs: pub const DCT_BATCH_CROSSOVER: usize = 64; +// simd_neon.rs (Apple Silicon): pub const DCT_BATCH_CROSSOVER: usize = 256; +// simd_scalar.rs: pub const DCT_BATCH_CROSSOVER: usize = usize::MAX; pub fn dct_apply(input: &[i16], output: &mut [i16]) { if N >= DCT_BATCH_CROSSOVER { - unsafe { dct_gemm_path(input, output) } + dct_gemm_path(input, output) // calls into ndarray::simd::* } else { - dct_butterfly_path(input, output) + dct_butterfly_path(input, output) // also calls into ndarray::simd::* } } ``` -R-5 commits these crossovers as **bench-tunable constants** emitted by Plan G's codec-bench calibration sub-target into the per-arch `OUT_DIR` file — not hand-guessed numbers, not a runtime `match` on a synthetic `Arch` enum, and never via host CPU probing under cross-compilation. The build script (driven by `CARGO_CFG_TARGET_*`) is the single source of truth for which integer compiles in. +The integer `DCT_BATCH_CROSSOVER` comes from one of two places: +1. **Hand-tuned default**: a known-good number per backend, checked into the backend file. +2. **Plan G calibration override**: `build.rs` may consult `CARGO_CFG_TARGET_FEATURE` + a pre-recorded calibration artifact from `codec-bench` and emit a refined const into `OUT_DIR`, included by the backend file. This is still compile-time selection — the build script never probes the host CPU, only reads Cargo's target-config env vars. + +Either way the constant is **fixed in the compiled binary**. R-5 commits these crossovers as bench-tunable but compile-time-fixed; the `cfg(target_feature)`-selected backend file is the single source of truth. --- From 8d043389c5a6e291ca1113aeac8ca30b0c575402 Mon Sep 17 00:00:00 2001 From: Claude Date: Fri, 22 May 2026 18:16:08 +0000 Subject: [PATCH 5/5] docs(pr-x12): position AMX/asm as backend, codec as polyfill consumer MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per the architecture clarification: AMX/AVX-512/NEON/SVE2 intrinsics and asm are backend-layer implementation details. The polyfill (ndarray::simd::*, ndarray::hpc::*) is the consumer-facing surface. When the codec body writes encoding code (Skip/Merge/Delta/Escape, basin lookups, tropical-GEMM RDO, rANS, EWA splat), it is a consumer of its own polyfill — same as burn, candle, lance-graph, surrealdb, WoA. The codec does not know it is on AMX. It does not name a backend symbol. It does not branch on architecture. Three-layer diagram added at §3 making the boundary explicit: Consumers (codec + downstream) ↓ same Rust API everywhere Polyfill surface (src/simd.rs cfg-selected re-exports) ↓ cfg substitutes ONE backend file Backend (simd_avx512.rs / simd_neon.rs / simd_scalar.rs) — AMX bytecode, AVX-512 asm, NEON intrinsics live HERE — and only here. Consumers never reach in. Also added the escape hatch as documented: very-hot inner loops MAY drop below the polyfill into a backend-specific intrinsic, but only inside src/simd_.rs itself, cfg-gated, parity-tested against the other backends, with `// SAFETY:` + sentinel-qa audit per CLAUDE.md. It is the exception, not the model. No consumer crate (codec body included) is ever the right place for it. Cleaned up "dispatch" terminology across §0, §1, §2, §7.4, §8.1, §9: the word was leaking the runtime-branching frame into compile-time- only contexts. Reserved "dispatch" for async task scheduling (WoA's job) and for the explicit polyfill prohibition statement; everywhere else uses "polyfill" or "backend selection" to keep the compile-time nature unambiguous. Reducer::dispatch_target speculation renamed to backend_target with a "still cfg-selected, not runtime-branched" qualifier. Per-arch code lives once, inside src/simd_.rs, behind the polyfill surface. The WoA fleet ships per-arch binaries. One build, one backend, one path. https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u --- .../pr-x12-woa-multiarch-orchestration.md | 87 +++++++++++++------ 1 file changed, 61 insertions(+), 26 deletions(-) diff --git a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md index 955b9ff1..e3f53e81 100644 --- a/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md +++ b/.claude/knowledge/pr-x12-woa-multiarch-orchestration.md @@ -1,15 +1,15 @@ # PR-X12 — WoA Orchestration & Multi-Arch Dispatch Lens > Date: 2026-05-22 -> Status: **perspective doc** — examines how the orchestration crates (`woa-rs`, `woa`, `q2`, `surrealdb`, `MedCare-rs`, `smb-office-rs`) consume the PR-X12 substrate, and how PR-X12's per-arch dispatch decisions (R-4, R-5, R-11) generalise to the entire HPC stack. +> Status: **perspective doc** — examines how the orchestration crates (`woa-rs`, `woa`, `q2`, `surrealdb`, `MedCare-rs`, `smb-office-rs`) consume the PR-X12 substrate, and how PR-X12's per-arch polyfill decisions (R-4, R-5, R-11) generalise to the entire HPC stack. > -> Premise: PR-X12 is not just a codec project. It's the **per-arch dispatch contract** that every consumer above `ndarray` will inherit. The codec is the first non-trivial test of whether that contract holds. +> Premise: PR-X12 is not just a codec project. It's the **per-arch polyfill contract** that every consumer above `ndarray` will inherit. The codec is the first non-trivial test of whether that contract holds. --- ## 0. Thesis -**Every consumer crate dispatches kernels across {Intel SPR, AMD Zen 4-5, ARM Graviton 3-4, Apple Silicon, NVIDIA Hopper-Blackwell} via the same `ndarray::hpc` capability traits.** PR-X12's per-arch DCT crossover (R-5) and latency assertion (R-11) aren't codec-specific — they're the canonical shape of how any consumer crate gates fast-paths. If the codec's per-arch story is wrong, the entire HPC consumer ecosystem inherits the bug. +**Every consumer crate calls the same `ndarray::simd::*` / `ndarray::hpc::*` polyfill surface, regardless of which arch the binary was built for.** The polyfill is a per-arch swap underneath, selected by `cfg(target_feature = ...)` at compile time (per §3 and the W1a contract). PR-X12's per-arch DCT crossover (R-5) and latency assertion (R-11) aren't codec-specific — they're the canonical shape of how any consumer crate's per-arch story bottoms out at the polyfill. If the codec's per-arch story is wrong, the entire HPC consumer ecosystem inherits the bug. --- @@ -23,18 +23,18 @@ In a real deployment, a `woa-rs` agent processing a request might: 4. Update node-local cache (`surrealdb`) 5. Emit response stream (codec again) -Steps 1, 2, 3, 5 all hit the `ndarray::hpc` BLAS layer. Each step has a per-arch fast-path: SPR uses AMX, Zen 4 uses VNNI+AVX-512, Graviton 3 uses SVE2, Apple uses NEON/AMX, Hopper uses tensor cores. **None of the consumer crates know which fast-path is active.** They call `blas_level2::batched_gemm` and the substrate dispatches. +Steps 1, 2, 3, 5 all bottom out at `ndarray::simd::*` and `ndarray::hpc::*`. Each is a polyfill consumer — they call e.g. `blas_level2::batched_gemm` and get whatever backend the binary was compiled with. **None of the consumer crates know which backend is active**, and they MUST NOT: backend-specific symbols (AMX bytecode, AVX-512 asm, NEON intrinsics, SVE2 predicates) live exclusively inside `src/simd_.rs` and never reach a consumer's source. The fleet ships per-arch binaries (§3.2); each binary embeds one backend file via cfg. -This is what makes PR-X12's R-4 / R-11 architecture-conditional bench gates *substrate policy*, not codec policy. R-4 says "Plan G clears at most on 1 of: SPR / Zen 4 / Graviton 3 / Apple M-class," and R-11 adds latency assertions. That same gate structure applies to: +This is what makes PR-X12's R-4 / R-11 architecture-conditional bench gates *substrate policy*, not codec policy. R-4 says "Plan G clears on each of: SPR / Zen 4 / Graviton 3 / Apple M-class" (per-arch CI matrix), and R-11 adds per-arch latency assertions. That same gate structure applies to: -- `burn` model serving (forward pass per arch) -- `candle` quantized inference (q4/q8 per arch) -- `lance-graph::blasgraph` graph queries (tropical-GEMM per arch) -- `surrealdb` HNSW search (vector dist per arch) -- `MedCare-rs` DICOM transform (DCT + wavelet per arch) -- `smb-office-rs` OCR + layout (conv + attention per arch) +- `burn` model serving (forward pass: same Rust, per-arch binary) +- `candle` quantized inference (q4/q8: same Rust, per-arch binary) +- `lance-graph::blasgraph` graph queries (tropical-GEMM: same Rust, per-arch binary) +- `surrealdb` HNSW search (vector dist: same Rust, per-arch binary) +- `MedCare-rs` DICOM transform (DCT + wavelet: same Rust, per-arch binary) +- `smb-office-rs` OCR + layout (conv + attention: same Rust, per-arch binary) -Every one of these inherits the dispatch contract. PR-X12 is the first to make it visible. +Every one of these inherits the polyfill contract: identical consumer-facing Rust, one cfg-selected backend per build. PR-X12 is the first to make the parity-test obligation visible. --- @@ -53,34 +53,65 @@ Every one of these inherits the dispatch contract. PR-X12 is the first to make i │ surrealdb, MedCare-rs, smb-office-rs │ │ (Each: ~1-5K LoC of generic code + traits) │ └────────────────────┬───────────────────────────────┘ - │ capability traits, target_feature + │ same Rust API on every arch ▼ ┌────────────────────────────────────────────────────┐ -│ ndarray::hpc (the dispatch substrate) │ +│ ndarray::hpc + ndarray::simd (polyfill substrate) │ │ blas_level{1,2,3}, fft, cam_pq, activations, │ │ simd_int_ops, bf16_tile_gemm │ │ (~15K LoC; PR-X12 ratchets at this layer) │ └────────────────────┬───────────────────────────────┘ - │ per-arch SIMD intrinsics + │ cfg(target_feature = …) picks ONE ▼ ┌────────────────────────────────────────────────────┐ -│ Hardware: SPR / Zen / Graviton / Apple / Hopper │ -└────────────────────────────────────────────────────┘ +│ Backend file (one per binary): │ +│ simd_avx512.rs → asm/intrinsics + AMX bytecode │ +│ simd_neon.rs → NEON / SVE2 intrinsics │ +│ simd_scalar.rs → portable fallback │ +└────────────────────┬───────────────────────────────┘ + ▼ + Hardware: SPR / Zen / Graviton / Apple ``` -**WoA never touches `target_feature` directly.** Its job is async scheduling, transport (Q2 over QUIC), persistence (surrealdb), and policy. The SIMD dispatch happens one layer below, in the consumer crates calling `ndarray::hpc`. +**WoA never touches `target_feature` directly.** Its job is async task scheduling, transport (Q2 over QUIC), persistence (surrealdb), and policy. Per-arch SIMD code lives exclusively inside the backend file (`simd_.rs`); the polyfill above swaps which file is compiled in via cfg. -This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tractable. The codec crate doesn't dispatch — it calls the substrate. WoA doesn't dispatch — it calls the codec, which calls the substrate. Per-arch code lives once, in `ndarray::hpc`. +This separation is what makes R-3's LoC envelope (≤1500 LoC codec body) tractable. The codec crate doesn't choose a backend — it calls the polyfill. WoA doesn't choose a backend — it calls the codec, which calls the polyfill. Per-arch code lives once, inside `src/simd_.rs`, behind the polyfill surface. --- ## 3. Per-arch substrate via compile-time polyfill -The PR-X12 substrate follows the project's W1a consumer contract (see `CLAUDE.md` and `.claude/knowledge/vertical-simd-consumer-contract.md`): **all dispatch is polyfill**. Per arch we ship a separate backend file with the same public surface, and `cfg(target_feature = ...)` selects exactly one to compile in. There is **no runtime CPU detection, no `HwCaps`/`CpuCaps` branching, no `if has_avx512 else …` dispatch, and no `unsafe { runtime_branch }` chain.** The target CPU is fixed at build time via `.cargo/config.toml` (`target-cpu=x86-64-v4` makes AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. One build, one path. +The PR-X12 substrate follows the project's W1a consumer contract (see `CLAUDE.md` and `.claude/knowledge/vertical-simd-consumer-contract.md`): **all dispatch is polyfill**. The stack has three layers, and only the bottom one is allowed to know about specific architectures: + +```text +┌────────────────────────────────────────────────────────────┐ +│ Consumers — codec encode/decode bodies, downstream crates │ +│ (ndarray-codec, burn, candle, lance-graph, surrealdb, │ +│ MedCare-rs, smb-office-rs, q2, WoA scheduler) │ +│ Call ndarray::simd::* directly. Never name a backend. │ +└────────────────────────┬───────────────────────────────────┘ + │ identical signatures everywhere + ▼ +┌────────────────────────────────────────────────────────────┐ +│ Polyfill surface — src/simd.rs │ +│ cfg(target_feature = ...) re-exports exactly ONE backend │ +│ to compile in. Same fn names, same types, every arch. │ +└────────────────────────┬───────────────────────────────────┘ + │ cfg substitutes one file + ▼ +┌────────────────────────────────────────────────────────────┐ +│ Backend — simd_avx512.rs / simd_neon.rs / simd_scalar.rs │ +│ This is where AMX bytecode, AVX-512 asm/intrinsics, │ +│ NEON loads, SVE2 predicates LIVE. Implementation detail. │ +│ Consumers above never reach in here. │ +└────────────────────────────────────────────────────────────┘ +``` + +There is **no runtime CPU detection, no `HwCaps`/`CpuCaps` branching, no `if has_avx512 else …` dispatch, and no `unsafe { runtime_branch }` chain.** The target CPU is fixed at build time via `.cargo/config.toml` (`target-cpu=x86-64-v4` makes AVX-512 mandatory on x86_64) or via the target triple for non-x86 builds. One build, one backend file compiled in, one path. ### 3.1 The polyfill primitive: cfg-selected per-arch files -The pattern is the same one already shipping in `src/simd*.rs` (per `CLAUDE.md` Repository Structure): +The pattern already shipping in `src/simd*.rs` (per `CLAUDE.md` Repository Structure): ```rust // src/simd.rs — consumer-facing surface, re-exports a single backend @@ -94,7 +125,11 @@ pub use crate::simd_neon::*; pub use crate::simd_scalar::*; ``` -Each backend file (`simd_avx512.rs`, `simd_neon.rs`, `simd_scalar.rs`) implements the same public functions with identical signatures. The W1a contract requires **all three backends + a parity test** before any new primitive lands. The codec body (`ndarray-codec`, see R-3) and downstream consumers (burn / candle / lance-graph / surrealdb / WoA fleet) call `ndarray::simd::*` directly — they never see or reason about which backend is active. The cfg substitutes one file at the use-site; consumer code is identical across architectures. +Each backend file implements the same public functions with identical signatures; **the actual AMX bytecode / AVX-512 asm / NEON intrinsics / SVE2 predicates are contained inside those files** and never escape. The W1a contract requires all three backends + a parity test before any new primitive lands. + +**The codec body is a consumer of this polyfill.** When `ndarray-codec` writes encoding code — Skip/Merge/Delta/Escape mode selection, basin lookups, tropical-GEMM RDO, rANS state-machine ticks, EWA splat composition — it calls `ndarray::simd::*` exactly the way `burn` / `candle` / `lance-graph` do. **The codec does not know it is on AMX.** It does not reach for `simd_avx512::*` directly, does not name a backend symbol, does not branch on architecture. The cfg at the polyfill layer picks the right backend at build time; the encoder is identical Rust across all architectures. + +**Escape hatch (rare).** A very small number of hot inner loops may need to drop below the polyfill into a backend-specific intrinsic for performance reasons that the polyfill surface genuinely cannot express. When that happens: the violation lives inside `src/simd_.rs` (where backend-specific code is already at home), is `cfg`-gated to that arch, is parity-tested against the other backends' equivalent, and gets a `// SAFETY:` + agent audit per `CLAUDE.md`'s sentinel-qa rule. **It is the exception, not the model.** No consumer crate — codec body included — is ever the right place for it. ### 3.2 Build-time CPU selection (not runtime detection) @@ -154,7 +189,7 @@ PR-X12 (R-11) commits a budget on `T_codec`: | Tropical-GEMM RDO | ≤ 50 µs per CTU on SPR | derived from R-7 cost analysis | | Basis::apply (DCT) | ≤ 2 µs per 32×32 block on SPR | derived from R-5 | -**WoA's contract:** if any of these are violated on a supported arch, the consumer can either accept the slowdown or refuse to schedule the request. WoA has visibility into per-arch dispatch quality via the substrate's metrics endpoint: +**WoA's contract:** if any of these are violated on a supported arch, the consumer can either accept the slowdown or refuse to schedule the request. WoA has visibility into per-arch polyfill performance (which backend was compiled into the binary it's running, plus stage-latency telemetry) via the substrate's metrics endpoint: ```rust ndarray::hpc::metrics::stage_latency_p99(stage: StageId) -> Duration; @@ -209,7 +244,7 @@ This is a model for many features that look "out of scope" for PR-X12 but actual - Federated codebook → swap pointer to handle (R-13) - 3DGS scene anchor → add SceneAnchor header_kind (x266 doc) -- GPU offload → add `Reducer::dispatch_target() -> DispatchTarget` (Plan E adjacent) +- GPU offload → add a `Reducer::backend_target() -> BackendTarget` hook to let consumers opt into a GPU polyfill at compile time (Plan E adjacent; still cfg-selected, not runtime-branched) - Speculative decode → add `Frame::is_speculative()` bit in header reserved field None of these are PR-X12 scope. All of them require ≤50 LoC of "anchor" in PR-X12. The discipline of M:H-NEW-2 + R-3's LoC envelope is what makes future anchoring possible without forking the codec. @@ -271,7 +306,7 @@ Quick tour of what each crate inherits from PR-X12 substrate decisions: ### 8.1 `burn` (model training/inference) -Uses `blas_level3::gemm` for matrix multiply, `activations` for nonlinearities, `cam_pq` for KV cache compression. Per-arch dispatch via the same target_feature paths. Will benefit directly from PR-X12's R-4 / R-11 latency-assertion infrastructure when it lands (burn has wanted this for ~14 months). +Uses `blas_level3::gemm` for matrix multiply, `activations` for nonlinearities, `cam_pq` for KV cache compression. Per-arch polyfill via the same `cfg(target_feature)` mechanism — `burn` itself never names a backend. Will benefit directly from PR-X12's R-4 / R-11 latency-assertion infrastructure when it lands (burn has wanted this for ~14 months). ### 8.2 `candle` (quantized inference) @@ -304,7 +339,7 @@ Owns the federation policy (R-13), the codec version negotiation, and the per-ar In light of the above, the irreducible commitments PR-X12 must keep for the consumer ecosystem: 1. **Substrate API stability** — `blas_level2::batched_gemm`, `cam_pq::kmeans`, `fft::dct_apply`, `activations::conv2d` keep their signatures across PR-X12 changes. Additions OK, breaks not OK. -2. **Per-arch dispatch transparency** — consumers continue calling capability-trait methods; the substrate continues choosing the right SIMD path. +2. **Per-arch polyfill transparency** — consumers continue calling the `ndarray::simd::*` / `ndarray::hpc::*` surface unchanged across arches; cfg at the polyfill layer selects exactly one backend at build time. Consumers never name a backend symbol. 3. **`Reducer` ordered-sum guarantee** — any consumer using `OrderedKahanReducer` (or similar) continues to get bit-exact cross-arch reductions. 4. **Latency-assertion CI infrastructure** — R-11's framework is consumer-callable for their own benches; not codec-private. 5. **Codebook handle indirection** (R-13) — the codec ships with the handle pattern, consumers can swap codebooks without forking.