Skip to content

Submission: 10L + Sliding Window eval (mean val_bpb=1.1899)#221

Open
shajalahamedcse wants to merge 4 commits intoopenai:mainfrom
shajalahamedcse:submission/seq4096-10l-sliding-window
Open

Submission: 10L + Sliding Window eval (mean val_bpb=1.1899)#221
shajalahamedcse wants to merge 4 commits intoopenai:mainfrom
shajalahamedcse:submission/seq4096-10l-sliding-window

Conversation

@shajalahamedcse
Copy link
Copy Markdown

@shajalahamedcse shajalahamedcse commented Mar 20, 2026

Key idea:
The model was like a student who studied short paragraphs but was being tested on long chapters — so we asked: what if it practiced on long text too? We changed one line (train_seq_len = 4096) so it trained on 4096-token passages instead of 1024, teaching it real long-range patterns, then evaluated it with overlapping windows (stride=64) so every word gets maximum context during scoring. We ran it on Modal 8×H100 GPUs, got a consistent mean of 1.1899 across 3 random seeds (1337, 42, 7)

Combine train_seq_len=4096 with 10 layers and sliding window evaluation (stride=64)

Seed results

Seed val_bpb Artifact size
1337 1.1900 15,115,793 B
42 1.1908 15,128,724 B
7 1.1888 15,154,068 B
mean 1.1899 ≤ 16MB ✓
std 0.0008

Config

num_layers     = 10
train_seq_len  = 4096
eval_stride    = 64  (sliding window)
warmdown_iters = 3600
matrix_lr      = 0.04
muon_momentum  = 0.95

Hardware

Modal 8×H100 SXM, torchrun --standalone --nproc_per_node=8, MAX_WALLCLOCK_SECONDS=600

Track: 10min_16mb
Author: shajalahamedcse

Key change: train_seq_len=4096 with 10 layers and sliding window
eval (stride=64). Training on longer sequences improves predictions
while keeping the same model architecture and evaluation method.

Seed results:
  seed=1337: val_bpb=1.1900, artifact=15,115,793B
  seed=42:   val_bpb=1.1908, artifact=15,128,724B
  seed=7:    val_bpb=1.1888, artifact=15,154,068B
  mean:      val_bpb=1.1899, std=0.0008

Hardware: Modal 8xH100 SXM, torchrun --nproc_per_node=8
Training capped at MAX_WALLCLOCK_SECONDS=600
Removed error traceback and submission results from log.
Removed error traceback and submission results from log.
@shajalahamedcse shajalahamedcse changed the title Seq4096 + 10L + Sliding Window eval (mean val_bpb=1.1899) 10L + Sliding Window eval (mean val_bpb=1.1899) Mar 20, 2026
@shajalahamedcse shajalahamedcse changed the title 10L + Sliding Window eval (mean val_bpb=1.1899) Submission: 10L + Sliding Window eval (mean val_bpb=1.1899) Mar 20, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Submission: 10L + Sliding Window eval (mean val_bpb=1.1899)

BPB: 1.1899 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA f8aaba857564, file records/track_10min_16mb/2026-03-20_Seq4096_10L_SlidingWindow/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=55483 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=10, vocab=1024, code=55483 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants