Skip to content

LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)#201

Open
machdragon wants to merge 1 commit intoopenai:mainfrom
machdragon:submission/lawa-frontier-int6-mlp3x
Open

LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551)#201
machdragon wants to merge 1 commit intoopenai:mainfrom
machdragon:submission/lawa-frontier-int6-mlp3x

Conversation

@machdragon
Copy link

@machdragon machdragon commented Mar 20, 2026

Summary

  • val_bpb = 1.1551 (int6 sliding window, stride=64) | 12.7 MB artifact | 8xH100
    SXM, 600s
  • Based on PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 (11L, int6, MLP3x, relu², FA3, SmearGate, BigramHash, OrthoInit,
    U-Net skips, WD=0.04)
  • SWA replaced by LAWA-EMA (exponential moving average, decay=0.995, float32
    shadow, every-step update)
  • Overtone init added (SVD power-law embedding spectrum for smoother int6
    quantization)
  • Two bug fixes: bigram proj zero-init override, sliding window partial-window overlap
Metric PR #198 This
Int6 sliding val_bpb (s64) 1.1318 1.1551
Int6 roundtrip val_bpb 1.1543 1.1779
Artifact size 15.7 MB 12.7 MB
Steps (600s) 7,412 6,715

Single seed run (seed=1337). Additional seed runs pending for statistical validation.

Test plan

  • Maintainer re-evaluation of train_gpt.py on 8xH100 with official harness
  • Multi-seed validation (seeds 1337, 42, 2025)

Our numbers:

┌────────────────────────────┬────────┐
│ Metric │ Value │
├────────────────────────────┼────────┤
│ Pre-quant val_bpb │ 1.1622 │
├────────────────────────────┼────────┤
│ Int6 roundtrip val_bpb │ 1.1779 │
├────────────────────────────┼────────┤
│ Int6 sliding val_bpb (s64) │ 1.1551 │
└────────────────────────────┴────────┘

Note: our 1.1551 is worse than PR #198's 1.1318. The artifact is 3 MB smaller (12.7 vs
15.7 MB), but the BPB regressed.

11-Layer Int6 + LAWA-EMA (decay=0.995) + Overtone Init, based on PR openai#198.
Replaces SWA with every-step EMA averaging. Fixes bigram proj zero-init
override and sliding window partial-window overlap. 12.7 MB artifact.

8xH100 SXM, 600s, seed=1337, 6715 steps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@machdragon machdragon force-pushed the submission/lawa-frontier-int6-mlp3x branch from 3693d95 to 31c16a6 Compare March 20, 2026 19:14
@machdragon machdragon changed the title Staging: LAWA-EMA frontier fork (pr162 base, SWA→LAWA) LAWA-EMA frontier fork (pr198 base, LAWA-EMA + Overtone Init (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, LAWA-EMA + Overtone Init (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA + Overtone Init (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, SWA -> LAWA + Overtone Init (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA (val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon changed the title LAWA-EMA frontier fork (pr198 base, SWA -> LAWA (val_bpb=1.1551) LAWA-EMA frontier fork (pr198 base, SWA -> LAWA val_bpb=1.1551) Mar 20, 2026
@machdragon machdragon marked this pull request as ready for review March 20, 2026 19:22
machdragon added a commit to machdragon/parameter-golf that referenced this pull request Mar 20, 2026
Built on PR openai#201 (LAWA-EMA + Int6 + Overtone + MLP3x, val_bpb=1.1551).
Adds four improvements targeting quantization fidelity and eval-time adaptation:

- KURE kurtosis regularization + R2 outlier penalty for int6-friendly weights
- Tanh weight reparameterization bounding effective weights to [-1,1]
- Parallel EMA tracks (0.995/0.999/0.9995) with proxy-eval selection
- Causal LoRA TTT (rank 8) ported from PR openai#77 for eval-time adaptation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant