[train] Fix NUMA and checkpoint serialization for optimizer CPU offload by tyler-griggs · Pull Request #1251 · NovaSky-AI/SkyRL

tyler-griggs · 2026-03-02T23:01:05Z

Summary

Three runtime patches discovered during GPU validation of GLM-4.7-Flash GRPO training on 8x A100-80GB with optimizer CPU offload.

Depends on: #1241 (transformers 5.x compatibility) -- GLM-4.7-Flash requires transformers>=5.0.0, and #1241 adds the necessary return_dict=False fixes to all apply_chat_template call sites.

Changes

NUMA memory interleave: use numa_set_interleave_mask instead of numa_set_membind. With optimizer CPU offload enabled, binding all workers to a single NUMA node causes system OOM during checkpoint serialization. Interleaving spreads memory pressure across all nodes. Falls back to the original membind if unavailable.
Skip optimizer state in checkpoint save/load: when optimizer_cpu_offload is enabled on the Megatron optimizer config, set _skip_optimizer_checkpoint = True. This causes PolicyWorkerBase and CriticWorkerBase to pass None for optimizer/scheduler during checkpoint save, avoiding serialization OOM and mcore 0.16.0 flattened_range errors. On load, optimizer/scheduler states are skipped and re-initialized fresh. Reduces checkpoint size from about 120 GiB to 56 GiB (model-only).

Files changed (2)

skyrl/backends/skyrl_train/workers/worker.py -- NUMA fix + save/load checkpoint patches
skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py -- set skip flag when CPU offload enabled

Validated on

8x A100-SXM4-80GB, CUDA 12.9, mcore 0.16.0, vLLM 0.16.0
GLM-4.7-Flash (30B MoE, TP=1/EP=8) with optimizer CPU offload (optimizer_offload_fraction=1.0)
Checkpoint save (56 GiB, 292s) + resume from checkpoint verified end-to-end

… CPU offload Three fixes discovered during GPU validation on 8x A100-80GB: 1. NUMA memory interleave: use numa_set_interleave_mask instead of numa_set_membind. With optimizer CPU offload, binding all workers to one NUMA node causes system OOM during checkpoint serialization. 2. Skip optimizer state in checkpoint save/load when CPU offload is enabled (_skip_optimizer_checkpoint flag). Avoids flattened_range errors in mcore 0.16.0 and reduces checkpoint size. Training resumes by re-initializing the optimizer from model-only checkpoints. 3. Apply skip-optimizer logic to both PolicyWorkerBase and CriticWorkerBase save_checkpoint/load_checkpoint methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

erictang000 · 2026-03-03T22:49:33Z

btw #1268 should fix the flattened_range related optimizer checkpoint issues

vercel Bot deployed to Preview March 2, 2026 23:06 View deployment

tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from ff39d8b to 7138b78 Compare March 2, 2026 23:50

vercel Bot deployed to Preview March 2, 2026 23:58 View deployment

tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 7138b78 to 6cef9b3 Compare March 3, 2026 05:37

vercel Bot deployed to Preview March 3, 2026 05:47 View deployment

tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 6cef9b3 to 192deb7 Compare March 3, 2026 17:44

tyler-griggs changed the title ~~WIP: Runtime patches for GLM-4.7-Flash MoE training~~ WIP: Runtime patches and algorithm enhancements for GLM-4.7-Flash MoE training Mar 3, 2026

tyler-griggs changed the title ~~WIP: Runtime patches and algorithm enhancements for GLM-4.7-Flash MoE training~~ [train] Runtime patches for GLM-4.7-Flash MoE training with optimizer CPU offload Mar 3, 2026

tyler-griggs changed the title ~~[train] Runtime patches for GLM-4.7-Flash MoE training with optimizer CPU offload~~ [train] Fix NUMA and checkpoint serialization for optimizer CPU offload Mar 3, 2026

tyler-griggs force-pushed the tgriggs/glm47-runtime-patches branch from 192deb7 to 838dfe6 Compare March 3, 2026 19:03

vercel Bot deployed to Preview March 3, 2026 19:10 View deployment

tyler-griggs closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251

[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/glm47-runtime-patches

tyler-griggs commented Mar 2, 2026 •

edited

Loading

Uh oh!

erictang000 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyler-griggs commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Files changed (2)

Validated on

Uh oh!

erictang000 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Mar 2, 2026 •

edited

Loading