[WIP] wallclock-aware context curriculum#203
Conversation
Community Review — [WIP] wallclock-aware context curriculumBPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.14s, dim=512, layers=10, vocab=1024, code=59286 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.14s, dim=512, layers=10, vocab=1024, code=59286 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
1024early, then switch to2048later, while keeping total batch tokens fixedint5MLP /int6attention export,zstd-22,Muon,SWA,SmearGate,BigramHash,OrthoInit, grad clip, sliding-window evalHypothesis
The previous negative result looked like early wallclock underfitting, not an export bottleneck. This trainer is meant to test whether cheaper early sequence lengths buy more useful optimizer steps under the same 600s cap, while still recovering some long-context training later in the run.
The idea is intentionally clean to iterate on:
SEQ_WARMUP_ENABLED=0gives the no-curriculum baselineSHORT_SEQ_LEN,FINAL_SEQ_LEN, andSEQ_WARMUP_FRACcontrol the curriculum directlyEARLY_ABORT_ENABLED=1adds cost guardrails for off-pace runsCurrent status
This is a staged WIP PR.
python3 -m py_compileAblation plan
SEQ_WARMUP_ENABLED=0SEQ_WARMUP_ENABLED=1 SHORT_SEQ_LEN=1024 FINAL_SEQ_LEN=2048 SEQ_WARMUP_FRAC=0.30SEQ_WARMUP_FRAC in {0.20, 0.30, 0.40}SHORT_SEQ_LEN in {768, 1024, 1536}SEED={1337, 42, 7}