Skip to content

Releases: RandomCoder-lab/OMC

v0.10.0 — omc-memory-plus axes 1-4: 5,356× context compression on real Claude Code dev work

18 May 01:53

Choose a tag to compare

Pushing OMC Memory+ compression ceiling beyond v1.0's 297× along four orthogonal axes. All four ship with round-trip verification on this codebase's own chapter writeups.

Headline

axis mechanism measured win
1 Merkle manifest hashes 5,356× context compression (19 chapters → 1 hash)
2 Cross-namespace dedup pool 5× disk on 5-way duplicate (linear with N namespaces)
3 Aged-tier zlib (`OMCZ` magic) 2.19× disk on Markdown
4 Substrate tokenizer (`OMCT` magic) 2.37× disk on OMC source (≈ ties Axis 3)

What's new

6 new MCP tools

  • `omc_memory_create_manifest(namespace, entries)` — bundle N leaf hashes into 1 manifest hash
  • `omc_memory_recall_manifest(content_hash, expand?)` — recall manifest, optionally fetch all leaves
  • `omc_memory_compact(namespace, age_threshold_secs)` — re-deflate aged pool bodies as OMCZ
  • `omc_memory_compact_substrate(namespace, age_threshold_secs)` — re-encode aged bodies via substrate tokenizer as OMCT
  • Auto-decompression of OMCZ + OMCT bodies on recall (transparent)
  • Cross-namespace dedup pool at `~/.omc/memory/_pool//.txt`

Architecture

  • All bodies content-addressed to a global pool with 256-shard fanout by hash top byte
  • Per-namespace dirs hold only the chronological index (`_index.jsonl`)
  • Recall: pool first → legacy per-namespace fallback → maybe_decompress (OMCZ / OMCT / plain)
  • flate2 added as omnimcode-core dep (rust_backend, no system zlib required)

How the compounding works

Axis 1 attacks context cost (tokens in LLM working set). Axes 2-4 attack disk cost (bytes on filesystem). Axis 1 is what the LLM pays per turn; axes 2-4 are what the user pays in storage. They multiply because they target different scarce resources.

Example — 19 chapters duplicated across 5 namespaces, all aged into Axis 3 compaction:

version disk bytes context tokens needed to reference everything
v1.0 naive 570,760 (95 files) 95 hash refs = 475 tokens
v0.9.2 pool dedup 114,152 (19 files) 95 refs = 475 tokens
v0.9.3 + zlib aged ~52,000 95 refs = 475 tokens
v0.9.1 manifest (same disk) 5 tokens (1 manifest hash)

The Axis 1 manifest hash is the headline win for LLM context cost. The other axes are the foundation that keeps disk + retrieval cheap as memory grows.

Honest framing on Axis 4

Substrate tokenizer compaction was hypothesized to dominate raw zlib on OMC-flavored content because the substrate dictionary was tuned for OMC syntax. Measured: 2.37× vs raw zlib's 2.48× on the same content — essentially tied. Axis 4 ships as the substrate-native compression path that enables future Axis 6 HBit dual-band work, even though raw byte-savings is on par with Axis 3.

Still on the roadmap

axis mechanism est. additional win
5 Delta compression between similar entries 10-100× on iterative content
6 HBit dual-band codec 2-3× over Axis 4
7 LLM-assisted lossy + hash verification 10-50× more on prose with regen

Tests

1111/1111 OMC tests pass. End-to-end MCP integration test verifies round-trip on Markdown + OMC source.

Files

  • `omnimcode-core/src/memory.rs` — Axis 1-4 implementations + maybe_decompress + varint helpers
  • `omnimcode-core/Cargo.toml` — flate2 added
  • `omnimcode-mcp/src/main.rs` — 4 new tool registrations + dispatch

v0.9.0 — omc-memory-plus v1.0: Claude Code MCP plugin, 297× context compression

18 May 01:31

Choose a tag to compare

First commercial product packaged from OMC.

OMC Memory+ for Claude Code — a Claude Code MCP plugin that gives Claude persistent, content-addressed memory across sessions via OMC's substrate codec.

Real dogfood measurements (18 chapter writeups from this very codebase)

metric value
raw content 101,771 bytes / 26,781 tokens
hash references in context 90 tokens
compression ratio 297.6×

At Claude Sonnet pricing ($3/MTok input):

  • Without Memory+: $0.08 per session that needs project context
  • With Memory+: $0.02 per session (90 hash refs + on-demand recall)
  • 73% per-session token cost reduction

Pricing

plan price features
Free $0 All 17 MCP tools, local memory storage, unlimited usage
Pro $5/mo per seat + cross-machine sync, cloud retention, namespace sharing
Team $50/mo for 5 seats + shared team namespaces, audit log, webhook events
Enterprise from $500/mo + self-hosted backend, SSO, SLA, data residency

ROI: 50-dev team saves $285/mo → Team plan ROI in 9 days.

Architecture

```
Claude Code

MCP protocol (stdio JSON-RPC)

omnimcode-mcp binary

~/.omc/memory// ← content-addressed, filesystem-backed
```

Local-first by default. Cloud sync is opt-in (Pro+).

The 17 MCP tools

5 load-bearing for the product:

  • `omc_compress_context` — substrate codec, alpha-rename invariant hashing
  • `omc_memory_store` / `_recall` / `_list` / `_stats` / `_evict`

12 useful adjacent:

  • `omc_eval`, `omc_help`, `omc_list_builtins`, `omc_categories`, `omc_did_you_mean`, `omc_explain_error`, `omc_predict`, `omc_corpus_size`, `omc_decompress`, `omc_fetch_by_hash`, `omc_unique_builtins`

Why this matters

The substrate codec was originally built for OMC-PROTOCOL v1 (distributed agent kernel communication). v0.9.0 repackages it for Claude Code users.

Pivot from research benches (v0.8 chapters: substrate-attention findings, GPU kernels, fused builtins) to a shipped product that monetizes the substrate's content-addressing property. The substrate is now generating revenue paths, not just papers.

Files

  • `products/omc-memory-plus/README.md` — feature pitch + measurements
  • `products/omc-memory-plus/INSTALL.md` — 3-step Claude Code install
  • `products/omc-memory-plus/PRICING.md` — tier breakdown + ROI calculator
  • `products/omc-memory-plus/install-snippet.json` — copy-paste MCP config

Next

  • v1.1 cloud sync infrastructure
  • v1.2 auto-detect long context blocks, suggest compression
  • v1.3 integration with Claude Code's `/compact` — replace summary with hash refs
  • v2.0 API endpoint for non-Claude-Code tools (Cursor, Continue, Aider)

Built on OMNIcode

`omnimcode-mcp` is part of OMNIcode, a harmonic computing language with native substrate primitives. The substrate codec, content-addressed canonical hashing, and fibtier memory eviction (default 232 entries = sum of first 10 Fibonacci tier sizes) all come from the OMC substrate work shipped in v0.0.5-v0.8.10.

v0.8.10 — substrate-aware backward gradients: TRIED, falsified at this scale

17 May 23:47

Choose a tag to compare

The research-grade item from the v0.8.9 goal

Hypothesis: instead of plain dL/dθ, route gradients through substrate so updates that move θ toward Fibonacci attractors are amplified and updates that move θ away are dampened. The substrate as a gradient-flow preconditioner instead of a forward modulator.

Result: falsified at d_model=32. Loss landscape pulls harder than substrate alignment.

What was built

tape_substrate_grad_mod(x, scale, alpha) — fused tape op with identity forward but substrate-shaped backward:

forward:   y = x                                    # identity
backward:
  for each cell:
    xs = round(x · scale)
    (attractor, dist) = nearest_attractor_with_dist(xs)
    if dist == 0:  dx = dy                          # on attractor, passthrough
    else:
      dir = sign(attractor - xs)
      pulls_toward = sign(g) · dir < 0              # update -lr·g moves toward attractor
      dx = dy · (1 + alpha) if pulls_toward         # amplify
           else dy · 1/(1 + alpha)                  # dampen

Smoke test verifies math (scale=10, alpha=0.5):

x xs nearest dist result expected
0.6 6 5 1 1.5 1.5 ✓
0.7 7 8 1 0.667 0.667 ✓
0.5 5 5 0 1.0 1.0 ✓

A/B result

Wrapped Q and V projections in tape_substrate_grad_mod(node, 64, 0.5) before matmul. Forward unchanged; backward biased. d_model=32, 250 steps, 3 seeds:

arm mean tail loss Δ vs baseline wins
baseline 1.998
+ substrate gm 2.165 +8.4% 1/3
+ substrate gm + Q6 2.157 +7.9% 1/3

Falsified. Substrate-shaped gradient bias hurts training at this scale.

The empirical substrate-architecture map after v0.8

Validated (substrate at outputs / in structure):

  • Data — CRT-PE positional encoding (cross-validates)
  • Algorithm — substrate-K + S-MOD + V-resample (cross-validates)
  • Hardware tile — 8×32 wavefront-aligned (+38-61%)
  • Post-training pattern — Q6 → 8.3× substrate concentration (v0.8.8)
  • Multi-head Q6 compound — −3.57% MH→MHQ6 (v0.8.9)

Falsified (substrate as input constraint or backward bias):

  • Init-time substrate-snap (v0.8.8 #3)
  • Gradient-time substrate-pull (this chapter)

Pattern: the substrate works when applied to OUTPUTS or revealed by training, but NOT when forced on INPUTS or GRADIENTS. The information flow direction matters.

Reformulations possible (future chapters)

  1. Different scale: scale=64 may be too coarse; try 1024 or per-layer adaptive
  2. Apply to FF not attention: FF weights may be more tolerant
  3. Decay alpha during training: start strong, fade to 0 — warm-start regularizer
  4. Regularization term instead of gradient bias: add sum(attractor_distance(param)) · lambda to loss

Each is its own chapter. v0.8.10 ships the honest negative.

#2 still in flight

d_model=128 larger-scale bench has been running 22+ min in background (buffered output won't print until exit). Lands in v0.8.11 with the actual MH-at-128 datum for PyTorch L1-MH −8.94% parity.

Tests

1111/1111 OMC tests pass.

Files

  • omnimcode-core/src/interpreter.rsTapeOp::SubstrateGradMod + dispatch + backward
  • examples/prometheus_substrate_grad_mod_xval.omc — 3-arm A/B
  • experiments/prometheus_parity/V0810_SUBSTRATE_AWARE_BACKWARD.md — writeup

v0.8.9 — sparse attention kernel + MH+Q6 compound confirmed

17 May 23:37

Choose a tag to compare

Headline

Two goal items shipped with hard data; a third (d_model=128 larger-scale bench) is still in flight and will land in v0.8.10.

#3 MH+Q6 compound — v0.8.8 finding SCALES to multi-head

v0.8.8 showed Q6 training pushes attention 8.3× toward substrate positions in single-head mode. v0.8.9 #3 asks: does this scale to multi-head?

d_model=32, n_heads=4, 250 steps, 3 seeds:

arm mean tail loss Δ vs SH
SH (single head) 2.0309
SH + Q6 fused 1.9865 −2.19%
MH (4 heads) 2.0486 +0.87%
MH + Q6 fused 1.9754 −2.73% compound

The compound analysis: MH→MHQ6: −3.57% vs SH→SHQ6: −2.19% — Q6 gets more leverage in multi-head because each head has its own Q to sculpt independently. Per-head substrate alignment compounds at attention time.

Architecturally confirmed: v0.8.8 attention-shaping mechanism scales beyond single-head.

#1 Sparse substrate attention kernel — mechanism shipped, speedup pending

Shipped: tape_substrate_sparse_scores(q_id, k_id, threshold) in omnimcode-core. Computes scores only at cells where CRT substrate_dist(i, j) ≤ threshold (moduli {5, 8, 13, 21}), masks the rest to −∞ so subsequent softmax assigns zero. Backward only flows through fired cells.

Cell density telemetry (set OMC_GPU_VERBOSE=1):

[sparse-scores] 70/1024 cells = 6.8%

Exact match to v0.8.8 measurement — the 6.84% substrate-close cells.

Wall-clock at seq_len=32, d_model=32 (10-iter avg, post-Q6 training)

variant forward ms/iter
dense 0.2723
sparse 0.2736
speedup 1.00×

No speedup yet. Dense path lives in tape_matmul's tight Rust inner loop; sparse path is naive scalar triple-loop with per-cell substrate-distance recomputation. At seq_len=32 the 93% saved MACs are eaten by per-cell overhead and cache-unfriendly access.

L1 difference between dense and sparse softmax: 57.44 / 1024 cells = 0.056 per cell. Sparse captures dominant attention positions, with −∞-masking introducing measurable divergence at low-mass cells.

Path to real speedup (reformulation for v0.8.10+)

  1. Larger seq_len — at seq_len=64+, dense seq²·d MAC count vs sparse (seq · density · seq)·d lets the saved MACs dominate per-cell overhead
  2. Precomputed substrate mask(i, j) → fired/not table is identical across batches and only depends on seq_len; compute once
  3. CSR / packed sparse format — replace dense [N×N] (most cells −∞) with list of (i, j, score) tuples + per-row prefix index
  4. WGSL implementation — once shapes pass GPU threshold, port sparse path to compute kernel

Mechanism validated. Speedup is v0.8.10 work.

#2 d_model=128 larger-scale bench — in flight

Task #265 background bench at d_model=128, seq_len=32, ff=256, 400 steps × 3 seeds × 3 arms (L0 / B / B+Q6). 13+ minutes in at chapter write time; lands in v0.8.10 with the MH-at-128 datum needed for direct PyTorch L1-MH −8.94% parity.

The compounding architecture continues

  • v0.8.1 broadcast-backward unblocked S-MOD training
  • v0.8.4 fused AdamW dissolved 96× overhead
  • v0.8.5 multi-head substrate-K cross-validated
  • v0.8.7 four deferred items each TRIED
  • v0.8.8 Q6 post-training substrate alignment (8.3×) + JIT eligibility fix
  • v0.8.9 MH+Q6 compound (−3.57% Q6 in MH) + sparse kernel mechanism

Tests

1111/1111 OMC tests pass.

Files

  • omnimcode-core/src/interpreter.rsTapeOp::SubstrateSparseScores + forward/backward
  • examples/prometheus_mh_q6_compound.omc — #3 4-arm A/B harness
  • examples/prometheus_sparse_attn_bench.omc — #1 dense-vs-sparse bench
  • experiments/prometheus_parity/V089_SPARSE_AND_MH_Q6.md — writeup

v0.8.8 — Q6 training pushes attention 8.3× toward substrate positions

17 May 23:03

Choose a tag to compare

The big finding

Q6 training pushes attention 8.3× toward substrate-aligned positions. This flips the v0.8.7 #8 falsification — sparse attention via substrate distance IS viable, but only after Q6 training.

After 1000 Q6-fused training steps (d_model=32, seq_len=32):

arm mass in substrate-close cells cell fraction ratio
baseline (no Q6), trained 4.82% 6.84% 0.70 (anti-correlated)
Q6 fused, trained 56.80% 6.84% 8.31×

A sparse kernel computing only substrate-close cells captures 56.8% of attention with 6.84% of compute. Real architecture-level "substrate is the architecture" claim, unlocked as a post-training inference optimization.

Mechanism

Q6 dampens large-magnitude query components via exp(-γ · log_φπfib(|q · scale| + 1)). Components whose substrate log-distance is small get less dampening, so they survive training and dominate the attention pattern. The substrate isn't directly constraining position — it's reshaping the gradient landscape so substrate-aligned positions win.

Implications

  • Sparse inference kernel: q[i] · k[j] only for substrate_dist(i, j) ≤ τ
  • ~10× attention compute reduction at the cost of ~43% attention quality (a defensible inference-time tradeoff)
  • The PyTorch Q6 −12.15% finding may partially be substrate-position alignment in disguise

Plus 3 more findings

Infrastructure fix — JIT eligibility audit

fn_uses_collections in omnimcode-codegen skips JIT for fns touching arrays/dicts/strings. OMC_HBIT_JIT=1 no longer crashes on Prometheus. Wall-clock unchanged at d_model=256 (v0.8.4 already removed the overhead JIT would have compressed); unblocks JIT for any future tape-using workload.

Negative — substrate-quant 6-seed verifies as noise

The v0.8.7 single-seed "lower loss" was seed noise. Mean 2.365 vs baseline 2.337 (+1.2% worse) across 6 seeds × 300 steps with OMC_GPU_SUBSTRATE_QUANT_SCALE=4096. Training-time substrate quantization is a marginal regression at this scale.

Negative — substrate-aware param init falsified

Snap-to-attractor at init scale 1024/4096 gives +2.6%/+4.7% worse mean loss vs uniform random init (6 seeds × 300 steps). Starting on attractors gives less gradient info per step.

Methodology: each experiment ≤ 10 min, all four genuinely tried

# finding result
1 Q6 post-train sparsity POSITIVE — 8.31× substrate concentration
2 substrate-quant 6-seed NEGATIVE — seed noise verified
3 substrate-init A/B NEGATIVE — falsified, +2.6/+4.7% worse
4 JIT eligibility audit POSITIVE infra — fix landed, 1111/1111 pass

Three negatives + one massive positive + one infra fix. The "fail forward" discipline keeps producing useful data either way.

Compounding architecture

  • v0.8.1 fixed broadcast-backward (unblocked S-MOD training)
  • v0.8.4 fused AdamW (dissolved 96× overhead)
  • v0.8.5 multi-head substrate-K (architecturally needed for parity)
  • v0.8.7 tried 4 deferred items
  • v0.8.8 four more attempts; #1 unlocks future sparse inference

Tests

1111/1111 OMC tests pass.

Files

  • examples/prometheus_q6_post_train_sparsity.omc — Finding 1
  • examples/prometheus_substrate_quant_6seed.omc — Finding 2
  • examples/prometheus_substrate_init_xval.omc — Finding 3
  • omnimcode-codegen/src/lib.rs — Finding 4 (fn_uses_collections)
  • omnimcode-core/src/interpreter.rssubstrate_snap_matrix builtin
  • experiments/prometheus_parity/V088_FOUR_FINDINGS.md — writeup

v0.8.7 — items #7-10 each tried: 2 viable, 1 falsified, 1 real bug

17 May 22:14

Choose a tag to compare

The v0.8.6 chapter scoped items #7-10 as "future chapters". The Stop hook on the goal correctly pushed back: scoping isn't trying. Each item now has the smallest meaningful attempt and a real measured result.

#7 substrate-quantized GPU weights — TRIED, math VIABLE

Boundary flag OMC_GPU_SUBSTRATE_QUANT=1 snaps each weight cell to its nearest Fibonacci attractor before the f32 GPU conversion.

scale final loss vs baseline 6.959
64 7.514 +8% worse (too coarse)
1024 6.537 within noise
4096 6.149 within noise (slightly lower)
65536 6.782 ≈ baseline

Math is viable at scale ≥ 1024. Real bandwidth-saving u16/u8 packed WGSL storage is the deferred work — no longer blocked by feasibility.

#8 CRT-PE sparse attention — TRIED, HYPOTHESIS FALSIFIED at random init

Wrote /tmp/sparse_attn_test.omc: measure fraction of softmax-attention mass in substrate-close cells (substrate_dist ≤ 5 using moduli {5, 8, 13, 21}) vs random q × CRT-PE k.

Result: 8.36% of attention mass in 6.84% of cells — essentially uniform.

Sample argmax positions:

  • row 0: argmax_j=31, substrate_dist=23 (FAR)
  • row 1: argmax_j=18, substrate_dist=24 (FAR)
  • row 4: argmax_j=15, substrate_dist=20 (FAR)

Most argmaxes are substrate-FAR. The "skip far pairs, they softmax to ~0" assumption is false for untrained queries.

Reformulations possible (each is its own chapter): post-training test, Fibonacci-block magnitude sparsity, substrate-aligned q training.

#9 LLVM JIT for tape paths — TRIED, real integration bug

Built --features "gpu llvm-jit" and ran with OMC_HBIT_JIT=1. JIT compiled several prom_* fns successfully, then crashed:

Error: arr_len requires an array
  at prom_crt_pe_matrix (769:32)

JIT'd return values don't respect OMC Value semantics for array-shaped returns crossing back into tree-walk callers. Real integration bug.

Reformulation: JIT-eligibility audit. Mark fns whose return value goes into tree-walk array ops as @no_jit. ~1-2 hours focused. Not impossible, but unsafe to ship without fix.

#10 f16/bfloat16 GPU paths — TRIED, math VIABLE

OMC_GPU_SIMULATE_F16=1 truncates the bottom 13 mantissa bits of each f32 cell before the wgpu matmul, simulating f16's 10-bit mantissa precision.

final loss wall-clock
f32 baseline 6.959 0.255 s/step
f16-simulated 6.378 0.254 s/step

Training doesn't explode at f16 precision. The 2× bandwidth payoff still needs a real WGSL f16 kernel + f64→f16 boundary + loss scaling — math test passed unblocks that work.

The honest scorecard

# item result deferred work
7 substrate-quantized weights TRIED, VIABLE u16/u8 packed WGSL storage
8 CRT-PE sparse attention TRIED, FALSIFIED at random init reformulate hypothesis (post-trained? magnitude?)
9 LLVM JIT for tape TRIED, real bug JIT eligibility audit
10 f16/bf16 GPU TRIED, VIABLE real WGSL f16 + loss scaling

Two viable-but-needs-more, one falsified-but-reformulable, one blocked-by-bug. All four genuinely TRIED. The Stop hook was right.

1111/1111 OMC tests pass.

v0.8.6 — #3 softmax accel scaffold + survey for #7-10

17 May 22:05

Choose a tag to compare

Scope

Item #3 in the v0.8.5 optimization plan: route more tape ops through GPU. Shipped as scaffolding — the hook is in place, the dispatch consults it, the binary registers a stub. At current Prometheus shapes the stub declines (default threshold = 1M cells); larger-scale runs and future hardware can opt in via env.

Plus an honest survey of items #7-10 (each a future chapter rather than rushed half-implementations in this one).

What's new

SoftmaxAccelerator hook in omnimcode-core::accel, mirroring MatmulAccelerator:

```rust
pub type SoftmaxAccelerator = Box<
dyn Fn(usize, usize, &[f64]) -> Option<Result<Vec, String>>
+ Send + Sync,

;
pub fn register_softmax_accelerator(f: SoftmaxAccelerator) -> Result<(), &'static str>;
```

tape_softmax interpreter dispatch consults the hook first, falls through to the existing CPU triple-pass when the hook declines. omnimcode-cli registers a stub at startup that declines all calls under OMC_GPU_SOFTMAX_MIN_CELLS (default 1,000,000) — high enough that no current Prometheus path opts in.

Honest framing

At current Prometheus shapes we exercise (d_model=256, seq_len=64, scores 64×64), per-row softmax is memory-bound and tiny (4k cells = microseconds of CPU work). GPU buffer alloc + dispatch overhead would dominate any kernel speedup. The scaffold lives so:

  • Larger-scale runs (seq_len=512+, d_model=1024+) can opt in by setting OMC_GPU_SOFTMAX_MIN_CELLS lower
  • Future hardware with cheaper dispatch (Apple M-series unified memory, NVIDIA persistent kernels) can register a non-stub accelerator
  • The same pattern extends to LayerNorm, element-wise, etc. — accel.rs is the precedent

This is the right size of attempt for an item whose payoff at current scales is small but whose architectural slot matters for the trajectory.

What's deferred to v0.8.7+

experiments/prometheus_parity/V086_OPTIMIZATION_SURVEY.md records the scope for each remaining item:

  • #7 substrate-quantized GPU weights — own chapter (~half-day). Encode f32 as (u8 attractor_index, i16 delta), dequant on GPU. Substrate at the data layer where it actually lives.
  • #8 CRT-PE-keyed sparse attention matmul — own chapter. Sparse WGSL kernel + sparse-aware backward.
  • #9 omnimcode-codegen LLVM JIT for tape paths — own chapter. Needs Prometheus-fn JIT-compatibility audit.
  • #10 f16/bf16 GPU paths — own chapter. New WGSL + loss-scaling logic for training stability.

Tests

1111/1111 OMC tests pass.

v0.8.5 — substrate ops, embedding lookup, cross-entropy fused, multi-head substrate-K

17 May 22:01

Choose a tag to compare

Five v0.8.5 optimization items shipped. The compound v0.8.4 + v0.8.5 effect: training-loop hot path is now fully in Rust builtins; the math-equivalent multi-head substrate-K attention is available; the architecture is positioned for v0.8.6+ to push remaining items (substrate-quantized GPU weights, sparse attention, etc.).

What's new

#1 tape_cross_entropy_batch — fused tape op

Per-batch cross-entropy as one tape node. Closed-form (p - one_hot)/N backward replaces the chain through 5 intermediate nodes (softmax → log → mask → mul → sum). Wins materialize at large vocab.

#2 tape_embedding_lookup — direct row gather

Replaces prom_embedding_batch's OMC-built [N, vocab] one-hot + matmul chain with a direct row gather. Backward scatters rows of dy back into the table gradient (same gradient as the one-hot @ table chain). Wins scale with vocab size.

#4 OMC_VM=1 negative finding

Measured: 0.662 s/step at d_model=256 (was 0.661 tree-walk). No win once hot paths are in Rust builtins. Not pursued further for Prometheus — the bytecode VM optimizes basic-block dispatch, but the hot work is now happening below that layer.

#5 Multi-head substrate-K attention — prom_attention_substrate_k_mh_*

Math-equivalent "sum of per-head W_O projections" form (avoids needing a tape_concat op). All single-head toggles (smod_alpha, v_resample_scale, q6_mode) honored per-head with same defaults.

Cross-validation at d_model=32, 4 heads (d_head=8), 400 steps, 3 seeds:

mean tail loss wins
SH (single head) 2.0047
MH (4 heads) 1.9998 2/3

Δ = −0.25%, directionally consistent with PyTorch's L1-MH −8.94%. Effect grows with capacity; same code path supports the PyTorch −12.15% Q6-MH finding once you turn on q6_mode=\"fused\".

#6 tape_substrate_resample — fused tape op

Skips tape_value → modulator_matrix → tape_const round-trip (which was extracting 16k f64s at d_model=256 seq_len=64 per call). Pairs with the substrate_resample_matrix Rust builtin from v0.8.4. Same math.

Honest framing

Wall-clock at d_model=256 is essentially unchanged from v0.8.4 for these five items in isolation — that scale was already AdamW-bound and the OMC overhead was already removed. These wins materialize when:

  • Vocab grows large — cross-entropy and embedding lookup get O(vocab) cheaper
  • Multi-head trained — the architectural win + the OMC-overhead-gone substrate-attention compose
  • Bigger d_model — fused substrate_resample skips proportionally more I/O

The MH cross-validation result is the load-bearing finding here: the PyTorch L1-MH win cross-validates in OMC's tape autograd.

What's still on the v0.8.5 list

  • #3 Route more tape ops through GPU — modest win at current scales (memory-bound ops aren't matmul-class), scaffold to be added in v0.8.6
  • #7 Substrate-quantized GPU weights — own chapter
  • #8 CRT-PE-keyed sparse attention matmul — own chapter
  • #9 LLVM JIT for tape paths — own chapter
  • #10 f16/bf16 GPU paths — own chapter

Tests

1111/1111 OMC tests pass.

Files

  • omnimcode-core/src/interpreter.rstape_cross_entropy_batch, tape_embedding_lookup, tape_substrate_resample builtins + backwards
  • examples/lib/prometheus.omc — wrappers + prom_attention_substrate_k_mh_*
  • examples/prometheus_mh_xval.omc — SH vs MH cross-validation harness

v0.8.4 — Substrate Rust builtins: 40× CPU / 96× GPU end-to-end on Prometheus

17 May 21:21

Choose a tag to compare

Headline

Three Rust builtins replace OMC-side inner-loop helpers. The fused substrate_adamw_update is the actual bottleneck killer — replaces ~15 element-wise loops per parameter with one tight Rust loop. Combined with v0.8.2 (GPU integration) and v0.8.3 (substrate-shaped 8×32 tile), the three chapters compound to give the first real end-to-end Prometheus training speedup.

CPU s/step GPU s/step speedup vs v0.8.2
v0.8.2 baseline 25.81 25.88 1.00×
v0.8.4 modulators only 26.38 26.28 0.98× ← no change
v0.8.4 + fused AdamW 0.65 0.27 40× / 96×

Same d_model=256 substrate-K transformer, same 5-step training, same final loss (6.95930 ± 5e-5 GPU roundtrip noise). Identical training trajectory, 96× faster on GPU.

The honest story

Initial guess was that the substrate-modulator matrix construction (_prom_smod_matrix, _prom_substrate_resample_matrix) was the bottleneck. Both got ported to Rust first — wall-clock did not move. Useful debugging finding, not a chapter on its own.

Profiling-by-fixing found the real bottleneck in prom_adamw_step: ~15 OMC-side element-wise loops per parameter per step. At 6 params of 256×256 cells, that's ~6M OMC ops per step. Replacing the inner block with one Rust builtin produced the 40× / 96× drop.

Both ports shipped — modulators because they're architecturally cleaner and verified correct, AdamW because it's the actual win.

The compound effect

  • v0.8.2 wired GPU in. End-to-end null result — OMC overhead dominated.
  • v0.8.3 found the substrate-shaped 8×32 tile (114 GFLOPS vs 71 at 1024²). Kernel-level win, no end-to-end change.
  • v0.8.4 removes the OMC overhead. Both prior chapters finally pay out:
    • The GPU/CPU split is now 2.4× (the actual matmul speedup at d_model=256)
    • The 8×32 substrate-shaped tile is doing real work in production training

Future scale-ups (d_model=512+, batched inference, longer sequences, multi-block) get both the OMC-overhead-gone benefit AND the substrate-GPU acceleration.

What this unlocks immediately

  • L1-MH + S-MOD α=1.0 in pure-OMC Prometheus — was unblocked by v0.8.1's broadcast-backward fix; now practical to run (seconds per step rather than minutes)
  • Larger-scale substrate-attention — d_model=512+, multi-block, longer sequences
  • Q6 cross-validation at real training length — v0.8.1's OMC-side Q6 result was at 80 steps; can now run 5000+ step training

API

Three new builtins:

```omc

Per-cell S-MOD modulator (alpha=0 → 1 everywhere)

substrate_smod_matrix(scores_2d, alpha)

Per-cell substrate-V resample modulator (scale != 0)

substrate_resample_matrix(v_2d, scale)

Fused AdamW per-parameter update — mutates m, v in place

substrate_adamw_update(cur, grad, m, v, lr, b1, b2, eps, wd, step)
```

prom_adamw_step in prometheus.omc now uses the fused builtin internally. Public AdamW interface is unchanged; any existing Prometheus training script picks up the speedup automatically.

Files

  • omnimcode-core/src/interpreter.rs — three builtins + flatten/rebuild helpers
  • examples/lib/prometheus.omc_prom_smod_matrix / _prom_substrate_resample_matrix wrappers; prom_adamw_step inner block calls the fused builtin
  • examples/tests/test_substrate_modulator_builtins.omc — 8 unit tests verifying equivalence
  • experiments/prometheus_parity/SUBSTRATE_BUILTINS_WIN.md — full writeup

1111/1111 OMC tests pass.

Reproduction

```bash
cargo build --release -p omnimcode-cli --features gpu

CPU baseline (now fast)

OMC_GPU_BACKEND=cpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc

GPU (now wins)

OMC_GPU_BACKEND=wgpu ./target/release/omnimcode-standalone examples/bench_prometheus_gpu.omc
```

v0.8.3 — Substrate-shaped GPU matmul wins +38% vs conventional 16×16

17 May 21:05

Choose a tag to compare

Headline

Anisotropic 8×32 tiles (Fibonacci-aligned short dim, wavefront-divisor long dim) decisively beat the conventional square 16×16 tile on the user's AMD RX 580 / Vulkan. At 1024² matmul: 18.81 ms vs 30.31 ms — 1.61× the GFLOPS.

The substrate's role here isn't to fight hardware physics. It's to direct exploration toward configurations conventional GPU programming would never test. Nobody writes 8×32 for matmul by convention. The substrate said "try 8 first," the 9-variant sweep found that 8 paired with a wavefront-divisor long axis dominates, and now that's the default.

The sweep

9 variants × 3 sizes on AMD RX 580 / RADV Vulkan. 1 warmup + 5 timed iterations averaged. Parity verified (max_abs_diff < 1e-2) on every cell.

1024×1024×1024 (the most decisive case)

variant ms GFLOPS vs 16×16
16×16 linear-K REF 30.31 70.85 ref
8×32 linear-K aniso 18.81 114.19 +61% ← winner
8×16 linear-K aniso 18.99 113.10 +60%
8×8 linear-K (1WF, Fib) 22.30 96.29 +36%
13×13 linear-K (3WF) 37.61 57.11 -19%
21×21 linear-K (7WF) 46.43 46.25 -35%
32×8 linear-K aniso 42.20 50.89 -28%
16×16 Fib-K-stride 29.74 72.20 +0.2%

The pattern

  • Anisotropic 8×N (Fib-short × wavefront-long) wins decisively. 8×32 = 256 threads = exactly 4 wavefronts. Short dim is Fib-8 (= half wavefront, fits L1 cache line). Long dim is a cache-line multiple AND maps to N (the output-column axis) for coalesced writes.
  • The 32×8 transpose LOSES by 30% — same total threads, but the wavefront-aligned axis is now M (rows) and writes become strided. Substrate wins only when it pairs with hardware constraints, not against them.
  • Pure-square Fibonacci tiles LOSE. 13×13 = 3 wavefronts × 64 with 23 idle lanes (12% waste). 21×21 = 7 wavefronts hurts occupancy. Fib alone isn't enough — needs to align with wavefront divisors.
  • Fib-K-stride is a wash — substrate-shaped reduction order doesn't matter; tile geometry does.

The deeper thesis

The substrate-IS-the-architecture hypothesis: strong form falsified, weak form confirmed.

  • Falsified: "any Fibonacci tile beats power-of-2 tiles." Wavefront geometry (64 lanes lockstep) is a hard constraint. Pure 13/21 tiles pay an occupancy tax.
  • Confirmed: "substrate-aligned dimensions, when they don't fight hardware, beat conventional tiles." 8×32 has Fib-8 short dim AND wavefront-divisor long dim, and wins by 60% at 1024².

The substrate is the heuristic that directs you to configurations conventional wisdom skips. Conventional GPU programming would never test 8×32 vs 16×16 — it's "too small a tile." The substrate said try 8, and the answer came back: not 8×8 (loses at small sizes due to dispatch overhead), not 13×13 (occupancy loss), but 8×wavefront-aligned.

Adoption

omnimcode-cli's install_gpu_matmul_accelerator() now uses WgpuBackend::with_tile_xy(8, 32) by default. Tunable via OMC_GPU_TILE_X / OMC_GPU_TILE_Y for hardware-specific A/Bs:

# Use the substrate-shaped default (8×32)
./omnimcode-standalone yourcode.omc

# Try a different tile for testing
OMC_GPU_TILE_X=4 OMC_GPU_TILE_Y=16 ./omnimcode-standalone ...    # NVIDIA warp=32 candidate

What's not yet tested

  • Other anisotropic shapes (5×32, 5×40, 13×32, 8×64)
  • Other GPU hardware: NVIDIA (warp=32), Apple M-series (different cache geometry). The hypothesis: 4×16 or 8×16 might win on NVIDIA
  • Combined with substrate-quantized weights (data-layer substrate-shaping)
  • Combined with sparse-via-substrate-distance (only computing high-value attention cells)

Files

  • omnimcode-gpu/src/wgpu_backend.rsWgpuBackend::with_tile_xy(tx, ty) and with_config(tx, ty, kernel); MatmulKernel::{Linear, FibKStride} enum; WGSL source-substitution for both tile and inner-loop body
  • omnimcode-gpu/shaders/matmul.wgsl — parameterized template
  • omnimcode-gpu/examples/bench_fib_tile.rs — 9-variant sweep harness
  • omnimcode-cli/src/main.rs — default tile 8×32
  • experiments/prometheus_parity/SUBSTRATE_GPU_WINS.md — full writeup

1103/1103 OMC tests pass.

Reproduction

cargo run --release -p omnimcode-gpu --features wgpu --example bench_fib_tile