Releases: RandomCoder-lab/OMC
v0.10.0 — omc-memory-plus axes 1-4: 5,356× context compression on real Claude Code dev work
Pushing OMC Memory+ compression ceiling beyond v1.0's 297× along four orthogonal axes. All four ship with round-trip verification on this codebase's own chapter writeups.
Headline
| axis | mechanism | measured win |
|---|---|---|
| 1 | Merkle manifest hashes | 5,356× context compression (19 chapters → 1 hash) |
| 2 | Cross-namespace dedup pool | 5× disk on 5-way duplicate (linear with N namespaces) |
| 3 | Aged-tier zlib (`OMCZ` magic) | 2.19× disk on Markdown |
| 4 | Substrate tokenizer (`OMCT` magic) | 2.37× disk on OMC source (≈ ties Axis 3) |
What's new
6 new MCP tools
- `omc_memory_create_manifest(namespace, entries)` — bundle N leaf hashes into 1 manifest hash
- `omc_memory_recall_manifest(content_hash, expand?)` — recall manifest, optionally fetch all leaves
- `omc_memory_compact(namespace, age_threshold_secs)` — re-deflate aged pool bodies as OMCZ
- `omc_memory_compact_substrate(namespace, age_threshold_secs)` — re-encode aged bodies via substrate tokenizer as OMCT
- Auto-decompression of OMCZ + OMCT bodies on recall (transparent)
- Cross-namespace dedup pool at `~/.omc/memory/_pool//.txt`
Architecture
- All bodies content-addressed to a global pool with 256-shard fanout by hash top byte
- Per-namespace dirs hold only the chronological index (`_index.jsonl`)
- Recall: pool first → legacy per-namespace fallback → maybe_decompress (OMCZ / OMCT / plain)
- flate2 added as omnimcode-core dep (rust_backend, no system zlib required)
How the compounding works
Axis 1 attacks context cost (tokens in LLM working set). Axes 2-4 attack disk cost (bytes on filesystem). Axis 1 is what the LLM pays per turn; axes 2-4 are what the user pays in storage. They multiply because they target different scarce resources.
Example — 19 chapters duplicated across 5 namespaces, all aged into Axis 3 compaction:
| version | disk bytes | context tokens needed to reference everything |
|---|---|---|
| v1.0 naive | 570,760 (95 files) | 95 hash refs = 475 tokens |
| v0.9.2 pool dedup | 114,152 (19 files) | 95 refs = 475 tokens |
| v0.9.3 + zlib aged | ~52,000 | 95 refs = 475 tokens |
| v0.9.1 manifest | (same disk) | 5 tokens (1 manifest hash) |
The Axis 1 manifest hash is the headline win for LLM context cost. The other axes are the foundation that keeps disk + retrieval cheap as memory grows.
Honest framing on Axis 4
Substrate tokenizer compaction was hypothesized to dominate raw zlib on OMC-flavored content because the substrate dictionary was tuned for OMC syntax. Measured: 2.37× vs raw zlib's 2.48× on the same content — essentially tied. Axis 4 ships as the substrate-native compression path that enables future Axis 6 HBit dual-band work, even though raw byte-savings is on par with Axis 3.
Still on the roadmap
| axis | mechanism | est. additional win |
|---|---|---|
| 5 | Delta compression between similar entries | 10-100× on iterative content |
| 6 | HBit dual-band codec | 2-3× over Axis 4 |
| 7 | LLM-assisted lossy + hash verification | 10-50× more on prose with regen |
Tests
1111/1111 OMC tests pass. End-to-end MCP integration test verifies round-trip on Markdown + OMC source.
Files
- `omnimcode-core/src/memory.rs` — Axis 1-4 implementations + maybe_decompress + varint helpers
- `omnimcode-core/Cargo.toml` — flate2 added
- `omnimcode-mcp/src/main.rs` — 4 new tool registrations + dispatch
v0.9.0 — omc-memory-plus v1.0: Claude Code MCP plugin, 297× context compression
First commercial product packaged from OMC.
OMC Memory+ for Claude Code — a Claude Code MCP plugin that gives Claude persistent, content-addressed memory across sessions via OMC's substrate codec.
Real dogfood measurements (18 chapter writeups from this very codebase)
| metric | value |
|---|---|
| raw content | 101,771 bytes / 26,781 tokens |
| hash references in context | 90 tokens |
| compression ratio | 297.6× |
At Claude Sonnet pricing ($3/MTok input):
- Without Memory+: $0.08 per session that needs project context
- With Memory+: $0.02 per session (90 hash refs + on-demand recall)
- 73% per-session token cost reduction
Pricing
| plan | price | features |
|---|---|---|
| Free | $0 | All 17 MCP tools, local memory storage, unlimited usage |
| Pro | $5/mo per seat | + cross-machine sync, cloud retention, namespace sharing |
| Team | $50/mo for 5 seats | + shared team namespaces, audit log, webhook events |
| Enterprise | from $500/mo | + self-hosted backend, SSO, SLA, data residency |
ROI: 50-dev team saves $285/mo → Team plan ROI in 9 days.
Architecture
```
Claude Code
↓
MCP protocol (stdio JSON-RPC)
↓
omnimcode-mcp binary
↓
~/.omc/memory// ← content-addressed, filesystem-backed
```
Local-first by default. Cloud sync is opt-in (Pro+).
The 17 MCP tools
5 load-bearing for the product:
- `omc_compress_context` — substrate codec, alpha-rename invariant hashing
- `omc_memory_store` / `_recall` / `_list` / `_stats` / `_evict`
12 useful adjacent:
- `omc_eval`, `omc_help`, `omc_list_builtins`, `omc_categories`, `omc_did_you_mean`, `omc_explain_error`, `omc_predict`, `omc_corpus_size`, `omc_decompress`, `omc_fetch_by_hash`, `omc_unique_builtins`
Why this matters
The substrate codec was originally built for OMC-PROTOCOL v1 (distributed agent kernel communication). v0.9.0 repackages it for Claude Code users.
Pivot from research benches (v0.8 chapters: substrate-attention findings, GPU kernels, fused builtins) to a shipped product that monetizes the substrate's content-addressing property. The substrate is now generating revenue paths, not just papers.
Files
- `products/omc-memory-plus/README.md` — feature pitch + measurements
- `products/omc-memory-plus/INSTALL.md` — 3-step Claude Code install
- `products/omc-memory-plus/PRICING.md` — tier breakdown + ROI calculator
- `products/omc-memory-plus/install-snippet.json` — copy-paste MCP config
Next
- v1.1 cloud sync infrastructure
- v1.2 auto-detect long context blocks, suggest compression
- v1.3 integration with Claude Code's `/compact` — replace summary with hash refs
- v2.0 API endpoint for non-Claude-Code tools (Cursor, Continue, Aider)
Built on OMNIcode
`omnimcode-mcp` is part of OMNIcode, a harmonic computing language with native substrate primitives. The substrate codec, content-addressed canonical hashing, and fibtier memory eviction (default 232 entries = sum of first 10 Fibonacci tier sizes) all come from the OMC substrate work shipped in v0.0.5-v0.8.10.
v0.8.10 — substrate-aware backward gradients: TRIED, falsified at this scale
The research-grade item from the v0.8.9 goal
Hypothesis: instead of plain dL/dθ, route gradients through substrate so updates that move θ toward Fibonacci attractors are amplified and updates that move θ away are dampened. The substrate as a gradient-flow preconditioner instead of a forward modulator.
Result: falsified at d_model=32. Loss landscape pulls harder than substrate alignment.
What was built
tape_substrate_grad_mod(x, scale, alpha) — fused tape op with identity forward but substrate-shaped backward:
forward: y = x # identity
backward:
for each cell:
xs = round(x · scale)
(attractor, dist) = nearest_attractor_with_dist(xs)
if dist == 0: dx = dy # on attractor, passthrough
else:
dir = sign(attractor - xs)
pulls_toward = sign(g) · dir < 0 # update -lr·g moves toward attractor
dx = dy · (1 + alpha) if pulls_toward # amplify
else dy · 1/(1 + alpha) # dampen
Smoke test verifies math (scale=10, alpha=0.5):
| x | xs | nearest | dist | result | expected |
|---|---|---|---|---|---|
| 0.6 | 6 | 5 | 1 | 1.5 | 1.5 ✓ |
| 0.7 | 7 | 8 | 1 | 0.667 | 0.667 ✓ |
| 0.5 | 5 | 5 | 0 | 1.0 | 1.0 ✓ |
A/B result
Wrapped Q and V projections in tape_substrate_grad_mod(node, 64, 0.5) before matmul. Forward unchanged; backward biased. d_model=32, 250 steps, 3 seeds:
| arm | mean tail loss | Δ vs baseline | wins |
|---|---|---|---|
| baseline | 1.998 | — | — |
| + substrate gm | 2.165 | +8.4% | 1/3 |
| + substrate gm + Q6 | 2.157 | +7.9% | 1/3 |
Falsified. Substrate-shaped gradient bias hurts training at this scale.
The empirical substrate-architecture map after v0.8
Validated (substrate at outputs / in structure):
- Data — CRT-PE positional encoding (cross-validates)
- Algorithm — substrate-K + S-MOD + V-resample (cross-validates)
- Hardware tile — 8×32 wavefront-aligned (+38-61%)
- Post-training pattern — Q6 → 8.3× substrate concentration (v0.8.8)
- Multi-head Q6 compound — −3.57% MH→MHQ6 (v0.8.9)
Falsified (substrate as input constraint or backward bias):
- Init-time substrate-snap (v0.8.8 #3)
- Gradient-time substrate-pull (this chapter)
Pattern: the substrate works when applied to OUTPUTS or revealed by training, but NOT when forced on INPUTS or GRADIENTS. The information flow direction matters.
Reformulations possible (future chapters)
- Different scale: scale=64 may be too coarse; try 1024 or per-layer adaptive
- Apply to FF not attention: FF weights may be more tolerant
- Decay alpha during training: start strong, fade to 0 — warm-start regularizer
- Regularization term instead of gradient bias: add
sum(attractor_distance(param)) · lambdato loss
Each is its own chapter. v0.8.10 ships the honest negative.
#2 still in flight
d_model=128 larger-scale bench has been running 22+ min in background (buffered output won't print until exit). Lands in v0.8.11 with the actual MH-at-128 datum for PyTorch L1-MH −8.94% parity.
Tests
1111/1111 OMC tests pass.
Files
omnimcode-core/src/interpreter.rs—TapeOp::SubstrateGradMod+ dispatch + backwardexamples/prometheus_substrate_grad_mod_xval.omc— 3-arm A/Bexperiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md— writeup
v0.8.9 — sparse attention kernel + MH+Q6 compound confirmed
Headline
Two goal items shipped with hard data; a third (d_model=128 larger-scale bench) is still in flight and will land in v0.8.10.
#3 MH+Q6 compound — v0.8.8 finding SCALES to multi-head
v0.8.8 showed Q6 training pushes attention 8.3× toward substrate positions in single-head mode. v0.8.9 #3 asks: does this scale to multi-head?
d_model=32, n_heads=4, 250 steps, 3 seeds:
| arm | mean tail loss | Δ vs SH |
|---|---|---|
| SH (single head) | 2.0309 | — |
| SH + Q6 fused | 1.9865 | −2.19% |
| MH (4 heads) | 2.0486 | +0.87% |
| MH + Q6 fused | 1.9754 | −2.73% compound |
The compound analysis: MH→MHQ6: −3.57% vs SH→SHQ6: −2.19% — Q6 gets more leverage in multi-head because each head has its own Q to sculpt independently. Per-head substrate alignment compounds at attention time.
Architecturally confirmed: v0.8.8 attention-shaping mechanism scales beyond single-head.
#1 Sparse substrate attention kernel — mechanism shipped, speedup pending
Shipped: tape_substrate_sparse_scores(q_id, k_id, threshold) in omnimcode-core. Computes scores only at cells where CRT substrate_dist(i, j) ≤ threshold (moduli {5, 8, 13, 21}), masks the rest to −∞ so subsequent softmax assigns zero. Backward only flows through fired cells.
Cell density telemetry (set OMC_GPU_VERBOSE=1):
[sparse-scores] 70/1024 cells = 6.8%
Exact match to v0.8.8 measurement — the 6.84% substrate-close cells.
Wall-clock at seq_len=32, d_model=32 (10-iter avg, post-Q6 training)
| variant | forward ms/iter |
|---|---|
| dense | 0.2723 |
| sparse | 0.2736 |
| speedup | 1.00× |
No speedup yet. Dense path lives in tape_matmul's tight Rust inner loop; sparse path is naive scalar triple-loop with per-cell substrate-distance recomputation. At seq_len=32 the 93% saved MACs are eaten by per-cell overhead and cache-unfriendly access.
L1 difference between dense and sparse softmax: 57.44 / 1024 cells = 0.056 per cell. Sparse captures dominant attention positions, with −∞-masking introducing measurable divergence at low-mass cells.
Path to real speedup (reformulation for v0.8.10+)
- Larger seq_len — at seq_len=64+, dense
seq²·dMAC count vs sparse(seq · density · seq)·dlets the saved MACs dominate per-cell overhead - Precomputed substrate mask —
(i, j) → fired/nottable is identical across batches and only depends on seq_len; compute once - CSR / packed sparse format — replace dense
[N×N](most cells−∞) with list of(i, j, score)tuples + per-row prefix index - WGSL implementation — once shapes pass GPU threshold, port sparse path to compute kernel
Mechanism validated. Speedup is v0.8.10 work.
#2 d_model=128 larger-scale bench — in flight
Task #265 background bench at d_model=128, seq_len=32, ff=256, 400 steps × 3 seeds × 3 arms (L0 / B / B+Q6). 13+ minutes in at chapter write time; lands in v0.8.10 with the MH-at-128 datum needed for direct PyTorch L1-MH −8.94% parity.
The compounding architecture continues
- v0.8.1 broadcast-backward unblocked S-MOD training
- v0.8.4 fused AdamW dissolved 96× overhead
- v0.8.5 multi-head substrate-K cross-validated
- v0.8.7 four deferred items each TRIED
- v0.8.8 Q6 post-training substrate alignment (8.3×) + JIT eligibility fix
- v0.8.9 MH+Q6 compound (−3.57% Q6 in MH) + sparse kernel mechanism
Tests
1111/1111 OMC tests pass.
Files
omnimcode-core/src/interpreter.rs—TapeOp::SubstrateSparseScores+ forward/backwardexamples/prometheus_mh_q6_compound.omc— #3 4-arm A/B harnessexamples/prometheus_sparse_attn_bench.omc— #1 dense-vs-sparse benchexperiments/prometheus_parity/V089_SPARSE_AND_MH_Q6.md— writeup
v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions
The big finding
Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.
After 1000 Q6-fused training steps (d_model=32, seq_len=32):
| arm | mass in substrate-close cells | cell fraction | ratio |
|---|---|---|---|
| baseline (no Q6), trained | 4.82% | 6.84% | 0.70 (anti-correlated) |
| Q6 fused, trained | 56.80% | 6.84% | 8.31× |
A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.
Mechanism
Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.
Implications
- Sparse inference kernel:
q[i] · k[j]only forsubstrate_dist(i, j) ≤ τ - ~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
- The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise
Plus 3 more findings
Infrastructure fix — JIT eligibility audit
fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.
Negative — substrate-quant 6-seed verifies as noise
The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.
Negative — substrate-aware param init falsified
Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.
Methodology: each experiment ≤ 10 min, all four genuinely tried
| # | finding | result |
|---|---|---|
| 1 | Q6 post-train sparsity | POSITIVE — 8.31× substrate concentration |
| 2 | substrate-quant 6-seed | NEGATIVE — seed noise verified |
| 3 | substrate-init A/B | NEGATIVE — falsified, +2.6/+4.7% worse |
| 4 | JIT eligibility audit | POSITIVE infra — fix landed, 1111/1111 pass |
Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.
Compounding architecture
- v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
- v0.8.4 fused AdamW (dissolved 96× overhead)
- v0.8.5 multi-head substrate-K (architecturally needed for parity)
- v0.8.7 tried 4 deferred items
- v0.8.8 four more attempts; #1 unlocks future sparse inference
Tests
1111/1111 OMC tests pass.
Files
examples/prometheus_q6_post_train_sparsity.omc— Finding 1examples/prometheus_substrate_quant_6seed.omc— Finding 2examples/prometheus_substrate_init_xval.omc— Finding 3omnimcode-codegen/src/lib.rs— Finding 4 (fn_uses_collections)omnimcode-core/src/interpreter.rs—substrate_snap_matrixbuiltinexperiments/prometheus_parity/V088_FOUR_FINDINGS.md— writeup
v0.8.7 — items #7-10 each tried: 2 viable, 1 falsified, 1 real bug
The v0.8.6 chapter scoped items #7-10 as "future chapters". The Stop hook on the goal correctly pushed back: scoping isn't trying. Each item now has the smallest meaningful attempt and a real measured result.
#7 substrate-quantized GPU weights — TRIED, math VIABLE
Boundary flag OMC_GPU_SUBSTRATE_QUANT=1 snaps each weight cell to its nearest Fibonacci attractor before the f32 GPU conversion.
| scale | final loss | vs baseline 6.959 |
|---|---|---|
| 64 | 7.514 | +8% worse (too coarse) |
| 1024 | 6.537 | within noise |
| 4096 | 6.149 | within noise (slightly lower) |
| 65536 | 6.782 | ≈ baseline |
Math is viable at scale ≥ 1024. Real bandwidth-saving u16/u8 packed WGSL storage is the deferred work — no longer blocked by feasibility.
#8 CRT-PE sparse attention — TRIED, HYPOTHESIS FALSIFIED at random init
Wrote /tmp/sparse_attn_test.omc: measure fraction of softmax-attention mass in substrate-close cells (substrate_dist ≤ 5 using moduli {5, 8, 13, 21}) vs random q × CRT-PE k.
Result: 8.36% of attention mass in 6.84% of cells — essentially uniform.
Sample argmax positions:
- row 0: argmax_j=31, substrate_dist=23 (FAR)
- row 1: argmax_j=18, substrate_dist=24 (FAR)
- row 4: argmax_j=15, substrate_dist=20 (FAR)
Most argmaxes are substrate-FAR. The "skip far pairs, they softmax to ~0" assumption is false for untrained queries.
Reformulations possible (each is its own chapter): post-training test, Fibonacci-block magnitude sparsity, substrate-aligned q training.
#9 LLVM JIT for tape paths — TRIED, real integration bug
Built --features "gpu llvm-jit" and ran with OMC_HBIT_JIT=1. JIT compiled several prom_* fns successfully, then crashed:
Error: arr_len requires an array
at prom_crt_pe_matrix (769:32)
JIT'd return values don't respect OMC Value semantics for array-shaped returns crossing back into tree-walk callers. Real integration bug.
Reformulation: JIT-eligibility audit. Mark fns whose return value goes into tree-walk array ops as @no_jit. ~1-2 hours focused. Not impossible, but unsafe to ship without fix.
#10 f16/bfloat16 GPU paths — TRIED, math VIABLE
OMC_GPU_SIMULATE_F16=1 truncates the bottom 13 mantissa bits of each f32 cell before the wgpu matmul, simulating f16's 10-bit mantissa precision.
| final loss | wall-clock | |
|---|---|---|
| f32 baseline | 6.959 | 0.255 s/step |
| f16-simulated | 6.378 | 0.254 s/step |
Training doesn't explode at f16 precision. The 2× bandwidth payoff still needs a real WGSL f16 kernel + f64→f16 boundary + loss scaling — math test passed unblocks that work.
The honest scorecard
| # | item | result | deferred work |
|---|---|---|---|
| 7 | substrate-quantized weights | TRIED, VIABLE | u16/u8 packed WGSL storage |
| 8 | CRT-PE sparse attention | TRIED, FALSIFIED at random init | reformulate hypothesis (post-trained? magnitude?) |
| 9 | LLVM JIT for tape | TRIED, real bug | JIT eligibility audit |
| 10 | f16/bf16 GPU | TRIED, VIABLE | real WGSL f16 + loss scaling |
Two viable-but-needs-more, one falsified-but-reformulable, one blocked-by-bug. All four genuinely TRIED. The Stop hook was right.
1111/1111 OMC tests pass.
v0.8.6 — #3 softmax accel scaffold + survey for #7-10
Scope
Item #3 in the v0.8.5 optimization plan: route more tape ops through GPU. Shipped as scaffolding — the hook is in place, the dispatch consults it, the binary registers a stub. At current Prometheus shapes the stub declines (default threshold = 1M cells); larger-scale runs and future hardware can opt in via env.
Plus an honest survey of items #7-10 (each a future chapter rather than rushed half-implementations in this one).
What's new
SoftmaxAccelerator hook in omnimcode-core::accel, mirroring MatmulAccelerator:
```rust
pub type SoftmaxAccelerator = Box<
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec, String>>
+ Send + Sync,
;
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str>;
```
tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when the hook declines. omnimcode-cli registers a stub at startup that declines all calls under OMC_GPU_SOFTMAX_MIN_CELLS (default 1,000,000) — high enough that no current Prometheus path opts in.
Honest framing
At current Prometheus shapes we exercise (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and tiny (4k cells = microseconds of CPU work). GPU buffer alloc + dispatch overhead would dominate any kernel speedup. The scaffold lives so:
- Larger-scale runs (seq_len=512+, d_model=1024+) can opt in by setting
OMC_GPU_SOFTMAX_MIN_CELLSlower - Future hardware with cheaper dispatch (Apple M-series unified memory, NVIDIA persistent kernels) can register a non-stub accelerator
- The same pattern extends to LayerNorm, element-wise, etc. —
accel.rsis the precedent
This is the right size of attempt for an item whose payoff at current scales is small but whose architectural slot matters for the trajectory.
What's deferred to v0.8.7+
experiments/prometheus_parity/V086_OPTIMIZATION_SURVEY.md records the scope for each remaining item:
- #7 substrate-quantized GPU weights — own chapter (~half-day). Encode f32 as
(u8 attractor_index, i16 delta), dequant on GPU. Substrate at the data layer where it actually lives. - #8 CRT-PE-keyed sparse attention matmul — own chapter. Sparse WGSL kernel + sparse-aware backward.
- #9 omnimcode-codegen LLVM JIT for tape paths — own chapter. Needs Prometheus-fn JIT-compatibility audit.
- #10 f16/bf16 GPU paths — own chapter. New WGSL + loss-scaling logic for training stability.
Tests
1111/1111 OMC tests pass.
v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K
Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).
What's new
#1 tape_cross_entropy_batch — fused tape op
Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.
#2 tape_embedding_lookup — direct row gather
Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.
#4 OMC_VM=1 negative finding
Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.
#5 Multi-head substrate-K attention — prom_attention_substrate_k_mh_*
Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.
Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:
| mean tail loss | wins | |
|---|---|---|
| SH (single head) | 2.0047 | — |
| MH (4 heads) | 1.9998 | 2/3 |
Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".
#6 tape_substrate_resample — fused tape op
Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.
Honest framing
Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:
- Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
- Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
- Bigger d_model — fused substrate_resample skips proportionally more I/O
The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.
What's still on the v0.8.5 list
- #3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
- #7 Substrate-quantized GPU weights — own chapter
- #8 CRT-PE-keyed sparse attention matmul — own chapter
- #9 LLVM JIT for tape paths — own chapter
- #10 f16/bf16 GPU paths — own chapter
Tests
1111/1111 OMC tests pass.
Files
omnimcode-core/src/interpreter.rs—tape_cross_entropy_batch,tape_embedding_lookup,tape_substrate_resamplebuiltins + backwardsexamples/lib/prometheus.omc— wrappers +prom_attention_substrate_k_mh_*examples/prometheus_mh_xval.omc— SH vs MH cross-validation harness
v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus
Headline
Three Rust builtins replace OMC-side inner-loop helpers. The fused substrate_adamw_update is the actual bottleneck killer — replaces ~15 element-wise loops per parameter with one tight Rust loop. Combined with v0.8.2 (GPU integration) and v0.8.3 (substrate-shaped 8×32 tile), the three chapters compound to give the first real end-to-end Prometheus training speedup.
| CPU s/step | GPU s/step | speedup vs v0.8.2 | |
|---|---|---|---|
| v0.8.2 baseline | 25.81 | 25.88 | 1.00× |
| v0.8.4 modulators only | 26.38 | 26.28 | 0.98× ← no change |
| v0.8.4 + fused AdamW | 0.65 | 0.27 | 40× / 96× |
Same d_model=256 substrate-K transformer, same 5-step training, same final loss (6.95930 ± 5e-5 GPU roundtrip noise). Identical training trajectory, 96× faster on GPU.
The honest story
Initial guess was that the substrate-modulator matrix construction (_prom_smod_matrix, _prom_substrate_resample_matrix) was the bottleneck. Both got ported to Rust first — wall-clock did not move. Useful debugging finding, not a chapter on its own.
Profiling-by-fixing found the real bottleneck in prom_adamw_step: ~15 OMC-side element-wise loops per parameter per step. At 6 params of 256×256 cells, that's ~6M OMC ops per step. Replacing the inner block with one Rust builtin produced the 40× / 96× drop.
Both ports shipped — modulators because they're architecturally cleaner and verified correct, AdamW because it's the actual win.
The compound effect
- v0.8.2 wired GPU in. End-to-end null result — OMC overhead dominated.
- v0.8.3 found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change.
- v0.8.4 removes the OMC overhead. Both prior chapters finally pay out:
- The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
- The 8×32 substrate-shaped tile is doing real work in production training
Future scale-ups (d_model=512+, batched inference, longer sequences, multi-block) get both the OMC-overhead-gone benefit AND the substrate-GPU acceleration.
What this unlocks immediately
- L1-MH + S-MOD α=1.0 in pure-OMC Prometheus — was unblocked by v0.8.1's broadcast-backward fix; now practical to run (seconds per step rather than minutes)
- Larger-scale substrate-attention — d_model=512+, multi-block, longer sequences
- Q6 cross-validation at real training length — v0.8.1's OMC-side Q6 result was at 80 steps; can now run 5000+ step training
API
Three new builtins:
```omc
Per-cell S-MOD modulator (alpha=0 → 1 everywhere)
substrate_smod_matrix(scores_2d, alpha)
Per-cell substrate-V resample modulator (scale != 0)
substrate_resample_matrix(v_2d, scale)
Fused AdamW per-parameter update — mutates m, v in place
substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)
```
prom_adamw_step in prometheus.omc now uses the fused builtin internally. Public AdamW interface is unchanged; any existing Prometheus training script picks up the speedup automatically.
Files
omnimcode-core/src/interpreter.rs— three builtins + flatten/rebuild helpersexamples/lib/prometheus.omc—_prom_smod_matrix/_prom_substrate_resample_matrixwrappers;prom_adamw_stepinner block calls the fused builtinexamples/tests/test_substrate_modulator_builtins.omc— 8 unit tests verifying equivalenceexperiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md— full writeup
1111/1111 OMC tests pass.
Reproduction
```bash
cargo build --release -p omnimcode-cli --features gpu
CPU baseline (now fast)
OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
GPU (now wins)
OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
```
v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16
Headline
Anisotropic 8×32 tiles (Fibonacci-aligned short dim, wavefront-divisor long dim) decisively beat the conventional square 16×16 tile on the user's AMD RX 580 / Vulkan. At 1024² matmul: 18.81 ms vs 30.31 ms — 1.61× the GFLOPS.
The substrate's role here isn't to fight hardware physics. It's to direct exploration toward configurations conventional GPU programming would never test. Nobody writes 8×32 for matmul by convention. The substrate said "try 8 first," the 9-variant sweep found that 8 paired with a wavefront-divisor long axis dominates, and now that's the default.
The sweep
9 variants × 3 sizes on AMD RX 580 / RADV Vulkan. 1 warmup + 5 timed iterations averaged. Parity verified (max_abs_diff < 1e-2) on every cell.
1024×1024×1024 (the most decisive case)
| variant | ms | GFLOPS | vs 16×16 |
|---|---|---|---|
| 16×16 linear-K REF | 30.31 | 70.85 | ref |
| 8×32 linear-K aniso | 18.81 | 114.19 | +61% ← winner |
| 8×16 linear-K aniso | 18.99 | 113.10 | +60% |
| 8×8 linear-K (1WF, Fib) | 22.30 | 96.29 | +36% |
| 13×13 linear-K (3WF) | 37.61 | 57.11 | -19% |
| 21×21 linear-K (7WF) | 46.43 | 46.25 | -35% |
| 32×8 linear-K aniso | 42.20 | 50.89 | -28% |
| 16×16 Fib-K-stride | 29.74 | 72.20 | +0.2% |
The pattern
- Anisotropic 8×N (Fib-short × wavefront-long) wins decisively. 8×32 = 256 threads = exactly 4 wavefronts. Short dim is Fib-8 (= half wavefront, fits L1 cache line). Long dim is a cache-line multiple AND maps to N (the output-column axis) for coalesced writes.
- The 32×8 transpose LOSES by 30% — same total threads, but the wavefront-aligned axis is now M (rows) and writes become strided. Substrate wins only when it pairs with hardware constraints, not against them.
- Pure-square Fibonacci tiles LOSE. 13×13 = 3 wavefronts × 64 with 23 idle lanes (12% waste). 21×21 = 7 wavefronts hurts occupancy. Fib alone isn't enough — needs to align with wavefront divisors.
- Fib-K-stride is a wash — substrate-shaped reduction order doesn't matter; tile geometry does.
The deeper thesis
The substrate-IS-the-architecture hypothesis: strong form falsified, weak form confirmed.
- Falsified: "any Fibonacci tile beats power-of-2 tiles." Wavefront geometry (64 lanes lockstep) is a hard constraint. Pure 13/21 tiles pay an occupancy tax.
- Confirmed: "substrate-aligned dimensions, when they don't fight hardware, beat conventional tiles." 8×32 has Fib-8 short dim AND wavefront-divisor long dim, and wins by 60% at 1024².
The substrate is the heuristic that directs you to configurations conventional wisdom skips. Conventional GPU programming would never test 8×32 vs 16×16 — it's "too small a tile." The substrate said try 8, and the answer came back: not 8×8 (loses at small sizes due to dispatch overhead), not 13×13 (occupancy loss), but 8×wavefront-aligned.
Adoption
omnimcode-cli's install_gpu_matmul_accelerator() now uses WgpuBackend::with_tile_xy(8, 32) by default. Tunable via OMC_GPU_TILE_X / OMC_GPU_TILE_Y for hardware-specific A/Bs:
# Use the substrate-shaped default (8×32)
./omnimcode-standalone yourcode.omc
# Try a different tile for testing
OMC_GPU_TILE_X=4 OMC_GPU_TILE_Y=16 ./omnimcode-standalone ... # NVIDIA warp=32 candidateWhat's not yet tested
- Other anisotropic shapes (5×32, 5×40, 13×32, 8×64)
- Other GPU hardware: NVIDIA (warp=32), Apple M-series (different cache geometry). The hypothesis: 4×16 or 8×16 might win on NVIDIA
- Combined with substrate-quantized weights (data-layer substrate-shaping)
- Combined with sparse-via-substrate-distance (only computing high-value attention cells)
Files
omnimcode-gpu/src/wgpu_backend.rs—WgpuBackend::with_tile_xy(tx, ty)andwith_config(tx, ty, kernel);MatmulKernel::{Linear, FibKStride}enum; WGSL source-substitution for both tile and inner-loop bodyomnimcode-gpu/shaders/matmul.wgsl— parameterized templateomnimcode-gpu/examples/bench_fib_tile.rs— 9-variant sweep harnessomnimcode-cli/src/main.rs— default tile 8×32experiments/prometheus_parity/SUBSTRATE_GPU_WINS.md— full writeup
1103/1103 OMC tests pass.
Reproduction
cargo run --release -p omnimcode-gpu --features wgpu --example bench_fib_tile