Skip to content

[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251

Closed
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/glm47-runtime-patches
Closed

[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/glm47-runtime-patches

Conversation

@tyler-griggs
Copy link
Copy Markdown
Member

@tyler-griggs tyler-griggs commented Mar 2, 2026

Summary

Three runtime patches discovered during GPU validation of GLM-4.7-Flash GRPO training on 8x A100-80GB with optimizer CPU offload.

Depends on: #1241 (transformers 5.x compatibility) -- GLM-4.7-Flash requires transformers>=5.0.0, and #1241 adds the necessary return_dict=False fixes to all apply_chat_template call sites.

Changes

  • NUMA memory interleave: use numa_set_interleave_mask instead of numa_set_membind. With optimizer CPU offload enabled, binding all workers to a single NUMA node causes system OOM during checkpoint serialization. Interleaving spreads memory pressure across all nodes. Falls back to the original membind if unavailable.

  • Skip optimizer state in checkpoint save/load: when optimizer_cpu_offload is enabled on the Megatron optimizer config, set _skip_optimizer_checkpoint = True. This causes PolicyWorkerBase and CriticWorkerBase to pass None for optimizer/scheduler during checkpoint save, avoiding serialization OOM and mcore 0.16.0 flattened_range errors. On load, optimizer/scheduler states are skipped and re-initialized fresh. Reduces checkpoint size from about 120 GiB to 56 GiB (model-only).

Files changed (2)

  • skyrl/backends/skyrl_train/workers/worker.py -- NUMA fix + save/load checkpoint patches
  • skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py -- set skip flag when CPU offload enabled

Validated on

  • 8x A100-SXM4-80GB, CUDA 12.9, mcore 0.16.0, vLLM 0.16.0
  • GLM-4.7-Flash (30B MoE, TP=1/EP=8) with optimizer CPU offload (optimizer_offload_fraction=1.0)
  • Checkpoint save (56 GiB, 292s) + resume from checkpoint verified end-to-end

@tyler-griggs tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from ff39d8b to 7138b78 Compare March 2, 2026 23:50
@tyler-griggs tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 7138b78 to 6cef9b3 Compare March 3, 2026 05:37
@tyler-griggs tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 6cef9b3 to 192deb7 Compare March 3, 2026 17:44
@tyler-griggs tyler-griggs changed the title WIP: Runtime patches for GLM-4.7-Flash MoE training WIP: Runtime patches and algorithm enhancements for GLM-4.7-Flash MoE training Mar 3, 2026
… CPU offload

Three fixes discovered during GPU validation on 8x A100-80GB:

1. NUMA memory interleave: use numa_set_interleave_mask instead of
   numa_set_membind. With optimizer CPU offload, binding all workers
   to one NUMA node causes system OOM during checkpoint serialization.

2. Skip optimizer state in checkpoint save/load when CPU offload is
   enabled (_skip_optimizer_checkpoint flag). Avoids flattened_range
   errors in mcore 0.16.0 and reduces checkpoint size. Training
   resumes by re-initializing the optimizer from model-only checkpoints.

3. Apply skip-optimizer logic to both PolicyWorkerBase and
   CriticWorkerBase save_checkpoint/load_checkpoint methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tyler-griggs tyler-griggs changed the title WIP: Runtime patches and algorithm enhancements for GLM-4.7-Flash MoE training [train] Runtime patches for GLM-4.7-Flash MoE training with optimizer CPU offload Mar 3, 2026
@tyler-griggs tyler-griggs changed the title [train] Runtime patches for GLM-4.7-Flash MoE training with optimizer CPU offload [train] Fix NUMA and checkpoint serialization for optimizer CPU offload Mar 3, 2026
@tyler-griggs tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 192deb7 to 838dfe6 Compare March 3, 2026 19:03
@erictang000
Copy link
Copy Markdown
Collaborator

btw #1268 should fix the flattened_range related optimizer checkpoint issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants