Skip to content

Staging: Int6 MLP3x 11L + SmearGate + BigramHash4096x128 + MuonWD038 + SWA50 + DocSliding (single-run val_bpb=1.1568)#208

Closed
ajkpersonal wants to merge 1 commit intoopenai:mainfrom
ajkpersonal:ajk-11L-seq2048-lexical
Closed

Staging: Int6 MLP3x 11L + SmearGate + BigramHash4096x128 + MuonWD038 + SWA50 + DocSliding (single-run val_bpb=1.1568)#208
ajkpersonal wants to merge 1 commit intoopenai:mainfrom
ajkpersonal:ajk-11L-seq2048-lexical

Conversation

@ajkpersonal
Copy link
Copy Markdown

@ajkpersonal ajkpersonal commented Mar 20, 2026

Summary

  • add records/track_10min_16mb/2026-03-20_Int6MLP3x_11L_SmearGate_Bigram4096x128_MuonWD038_SWA50_DocSliding
  • dense-lexical 11x512 KV4 candidate: seq2048, MLP_MULT=3, SmearGate, BigramHash(4096x128), MUON_WEIGHT_DECAY=0.038, ADAM_WEIGHT_DECAY=0.01, SWA_EVERY=50, SWA_START_FRAC=0.50
  • chosen legal export/eval path is int6_zstd_core with doc_sliding 2048/256
  • single recorded 8xH100 run reaches step:6038/20000 in 597185ms
  • chosen legal eval from the included sweep: val_loss=1.95474571, val_bpb=1.15677715, artifact_bytes=15704854
  • versus the current merged leaderboard leader on 2026-03-20, this is numerically better by 0.02877578 nats and 0.01797600 BPB

Notes

  • this is a single-run staging submission, not yet a leaderboard record claim
  • the current merged leaderboard leader on 2026-03-20 is mean_val_loss=1.98352149, mean_val_bpb=1.17475315; this run is numerically better, but more runs are still needed for the required significance test before making a SOTA claim
  • the built-in integrated export printed in train.log was slightly over the 16MB cap (artifact_bytes=16032236), so the promoted score in this folder comes from the included legal re-export path
  • the checked-in train_gpt.py is a whitespace-trimmed copy of the logged trainer source so it stays under the repo's 1500-line cap; behavior is unchanged
  • checked-in code/artifact sizes are 70147 and 15704854 bytes respectively

Test plan

  • PR diff only adds the new record folder relative to upstream/main
  • submission.json validates as JSON
  • train_gpt.py and checkpoint_frontier_sweep.py parse cleanly
  • run.sh and eval_doc2048_256.sh pass bash -n
  • train_gpt.py is under the line cap (1436 lines)
  • additional 8xH100 runs to establish p < 0.01 significance for a formal SOTA claim

@ajkpersonal ajkpersonal changed the title Staging: 11L dense-lexical doc-sliding candidate (single-run val_bpb=1.1568) Staging: Int6 MLP3x 11L + SmearGate + BigramHash4096x128 + MuonWD038 + SWA50 + DocSliding (single-run val_bpb=1.1568) Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant