Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64 by mrdavtan · Pull Request #199 · openai/parameter-golf

mrdavtan · 2026-03-20T09:49:42Z

Summary

Three controlled runs on fast 8×H100 hardware (44ms/step, 13,600+ steps) testing SWA and doc-isolated evaluation independently. Both techniques are being adopted by top entries without isolated ablation data.

Finding 1: SWA does not improve int8 quantization (default warmdown)

	No SWA	SWA (73 snapshots)	Delta
Post-quant val_bpb	1.1929	1.1933	+0.0004 (no improvement)

SWA may help with aggressive warmdown (WD=20000) where the LR stays higher and produces more diverse snapshots. Under default warmdown (WD=1200), the model is already in a narrow basin — averaging nearby snapshots doesn't find a flatter minimum.

Finding 2: Doc-isolated eval HURTS by 0.009 BPB at stride=64

	Flat-stream	Doc-isolated	Delta
Post-quant val_bpb	1.1933	1.2015	+0.0086 (worse)

This contradicts the LoRA TTT entry (#77) which found +0.011 from doc-isolation at stride=256. The difference: at stride=64, tokens have 960+ tokens of context. Removing cross-doc context at document boundaries means start-of-document tokens lose ALL context (window starts fresh), which hurts more than cleaner context helps.

The crossover: At stride=64 context quantity dominates; at stride=256+ context quality dominates. If you use stride=64, do not use doc-isolated eval.

Bonus: SWA bf16 accumulation bug

An initial implementation accumulated SWA weights in bf16 over thousands of steps, producing val_bpb=2.62 (catastrophic). Fix: accumulate in float32, sample every 50 steps. Documented for anyone implementing SWA in bf16 pipelines.

Reproduction

Three runs, same seed (1337), same architecture (9L×512d baseline), one variable per comparison. Full commands and analysis in README.

Test plan

Three runs on 8×H100 SXM, ~44ms/step, 13,600+ steps each
All artifacts under 16MB
Training logs available (on request — can add to logs/ folder)
Reproduces SlidingWindowEval baseline (1.1929 ≈ 1.1925)

Built with Claude Code

Three controlled runs on fast 8xH100 hardware (44ms/step, 13,600+ steps): - SWA does not improve int8 quantization under default warmdown (+0.0004 BPB) - Doc-isolated eval HURTS by 0.009 BPB at stride=64, contradicting LoRA TTT entry's +0.011 finding at stride=256 - Crossover: context quantity dominates at stride=64, quality at stride=256+ - Includes SWA bf16 accumulation bug discovery and fix

notapplica mentioned this pull request Mar 20, 2026

Parameter Golf Live AI Commentary + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64#199

Non-record: SWA and doc-isolated eval ablation — two negative findings at stride=64#199
mrdavtan wants to merge 1 commit intoopenai:mainfrom
mrdavtan:swa-dociso-ablation-pr

mrdavtan commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mrdavtan commented Mar 20, 2026

Summary

Finding 1: SWA does not improve int8 quantization (default warmdown)

Finding 2: Doc-isolated eval HURTS by 0.009 BPB at stride=64

Bonus: SWA bf16 accumulation bug

Reproduction

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant