[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251
Closed
tyler-griggs wants to merge 1 commit intomainfrom
Closed
[train] Fix NUMA and checkpoint serialization for optimizer CPU offload#1251tyler-griggs wants to merge 1 commit intomainfrom
tyler-griggs wants to merge 1 commit intomainfrom
Conversation
ff39d8b to
7138b78
Compare
7138b78 to
6cef9b3
Compare
6cef9b3 to
192deb7
Compare
… CPU offload Three fixes discovered during GPU validation on 8x A100-80GB: 1. NUMA memory interleave: use numa_set_interleave_mask instead of numa_set_membind. With optimizer CPU offload, binding all workers to one NUMA node causes system OOM during checkpoint serialization. 2. Skip optimizer state in checkpoint save/load when CPU offload is enabled (_skip_optimizer_checkpoint flag). Avoids flattened_range errors in mcore 0.16.0 and reduces checkpoint size. Training resumes by re-initializing the optimizer from model-only checkpoints. 3. Apply skip-optimizer logic to both PolicyWorkerBase and CriticWorkerBase save_checkpoint/load_checkpoint methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
192deb7 to
838dfe6
Compare
Collaborator
|
btw #1268 should fix the flattened_range related optimizer checkpoint issues |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three runtime patches discovered during GPU validation of GLM-4.7-Flash GRPO training on 8x A100-80GB with optimizer CPU offload.
Depends on: #1241 (transformers 5.x compatibility) -- GLM-4.7-Flash requires transformers>=5.0.0, and #1241 adds the necessary return_dict=False fixes to all apply_chat_template call sites.
Changes
NUMA memory interleave: use numa_set_interleave_mask instead of numa_set_membind. With optimizer CPU offload enabled, binding all workers to a single NUMA node causes system OOM during checkpoint serialization. Interleaving spreads memory pressure across all nodes. Falls back to the original membind if unavailable.
Skip optimizer state in checkpoint save/load: when optimizer_cpu_offload is enabled on the Megatron optimizer config, set _skip_optimizer_checkpoint = True. This causes PolicyWorkerBase and CriticWorkerBase to pass None for optimizer/scheduler during checkpoint save, avoiding serialization OOM and mcore 0.16.0 flattened_range errors. On load, optimizer/scheduler states are skipped and re-initialized fresh. Reduces checkpoint size from about 120 GiB to 56 GiB (model-only).
Files changed (2)
Validated on