Skip to content

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)#206

Open
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-step31-swa100-1.1494
Open

Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)#206
dexhunter wants to merge 2 commits intoopenai:mainfrom
dexhunter:weco-step31-swa100-1.1494

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

Mean val_bpb = 1.1507 (3-seed verified, p<0.001), beating merged SOTA (1.1748) by 0.024.

Evolved over 31 AIDE2 optimization steps from baseline 1.1607 on 8xH100.

Seed val_bpb Steps ms/step Artifact
1337 1.15022 10613 56.53 14,555,057
42 1.15095 10610 56.53 14,791,593
7 1.15099 10610 56.53 14,562,412
Mean 1.15072

Technique Stack

  1. Int6 STE — Fake int6 quantization every forward pass with STE gradient bypass
  2. NorMuon + WD=0.02 — Row-normalized Newton-Schulz with decoupled weight decay
  3. 3x MLP (1536 hidden) — Wider MLP enabled by int6 compression
  4. SmearGate — Learned gate blending token embeddings with predecessors (~512 params)
  5. Orthogonal Init — OrthoInit on all non-zero-init linear layers
  6. Seq2048 + RoPE Base 50K — 2x training context with adjusted RoPE
  7. SWA every 100 steps — More frequent checkpoint averaging during warmdown
  8. FP16 tied embedding — Embedding never quantized
  9. Sliding window eval (stride=64) — Every token scored with ~1984 tokens context
  10. Zstd-22 — Better compression than zlib
  11. U-Net skip connections — Encoder-decoder with learnable skip weights

Architecture

  • 9 layers, 512 dim, 8 heads, 4 KV heads (GQA)
  • Vocab 1024 (SentencePiece BPE), seq len 2048, tied embeddings
  • relu² activation, RoPE, logit softcapping (30.0)

Submission checklist

  • 3-seed verification (mean val_bpb=1.1507)
  • All artifacts < 16MB (max 14.79MB, 1.2MB headroom)
  • Wallclock < 600s on 8xH100
  • Train logs included (3 seeds)
  • Reproducible train_gpt.py included
  • submission.json with metadata

…SWA/100 (val_bpb=1.1507)

3-seed verified mean val_bpb=1.1507 (sliding window, stride=64).
Seeds: 1337=1.1502, 42=1.1509, 7=1.1510. All artifacts under 16MB.

Technique stack evolved over 31 AIDE2 optimization steps:
- Int6 STE quantization-aware training (near-zero quant penalty)
- NorMuon optimizer with decoupled weight decay (0.02)
- 3x MLP width (1536 hidden)
- SmearGate: learned embedding-level context blending
- Orthogonal initialization for all linear layers
- Sequence length 2048 with RoPE base 50K
- SWA every 100 steps during warmdown
- FP16 tied embedding passthrough
- Sliding window eval (stride=64)
- Zstd-22 compression
- U-Net skip connections
Leaderboard expects val_bpb, val_loss, bytes_total, bytes_code
at top level. Our submission used mean_val_bpb, artifact_bytes, etc.
abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 20, 2026
Downloaded train_gpt.py and README from the top open PRs on openai/parameter-golf:
- PR openai#198 (1.1318): 11L Int6 + WD + SWA + FA3 + SmearGate + BigramHash
- PR openai#194 (1.1480): 11L Int6 QAT + SmearGate + SWA
- PR openai#206 (1.1507): 9L Int6 STE + SmearGate + OrthoInit + U-Net skips

Updated program.md to point agent at PR openai#198 as the new starting base,
with detailed technique breakdown and strategy to beat 1.1318.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)

BPB: 1.1507 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA fb1b0345b4dd, file records/track_10min_16mb/2026-03-20_Int6STE_SmearGate_Seq2048_OrthoInit_RoPE50K_SWA100/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.23s, dim=512, layers=9, vocab=1024, code=55785 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.23s, dim=512, layers=9, vocab=1024, code=55785 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants