Skip to content

[Bug] 保存模型hang住 #1908

@yszhli

Description

@yszhli

Bug Description

当我使用异步训练的时候,二台机器,一台head 节点提交训练任务,8卡用于训练(master节点),一台worker节点4台用于rollout, 训练模型为Qwen3.5-35B-A3B, 训练数据为多模态数据, 训练模型分片为 --megatron-to-hf-mode bridge --train-backend megatron
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 8
--expert-tensor-parallel-size 1, 其中直接跑在update_weight会出错,后面修改了分布式更新权重方式与colocate模式下保存一样(使用megatron to hf mode为bridge中封装的权重收集方式来更新模型),但是在保存模型阶段,程序会hang在以下call stack 目前8卡进程都在一个地方Thread 20870 (idle): "MainThread"
gather (torch/distributed/distributed_c10d.py:4264)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/distributed_c10d.py:3312)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/checkpoint/utils.py:135)
save_state_dict_async_plan (core/dist_checkpointing/strategies/state_dict_saver.py:141)
async_save (core/dist_checkpointing/strategies/torch.py:797)
save (core/dist_checkpointing/strategies/base.py:223)
save (core/dist_checkpointing/strategies/fully_parallel.py:98)
save (core/dist_checkpointing/serialization.py:425)
save_checkpoint (training/checkpointing.py:573)
save (slime/backends/megatron_utils/model.py:705)
save_model (slime/backends/megatron_utils/actor.py:530)
wrapper (slime/utils/timer.py:78)
_resume_span (ray/util/tracing/tracing_helper.py:461)
actor_method_executor (ray/_private/function_manager.py:704)
main_loop (ray/_private/worker.py:1028)
(ray/_private/workers/default_worker.py:323)
Thread 21284 (idle): "PythonGCThread"
wait (threading.py:355)
wait (threading.py:655)
run (ray/_private/gc_collect_manager.py:29)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32260 (idle): "InductorSubproc"
_recv_msg (torch/_inductor/compile_worker/subproc_pool.py:73)
_read_thread (torch/_inductor/compile_worker/subproc_pool.py:228)
run (threading.py:1010)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32261 (idle): "Thread-1"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030) 而在使用单机colocate模式下保存可以正常进行,同时如果将保存格式使用torch替换默认的torch_dist也可以保存成功

Steps to Reproduce

如上述描述程序进行,多机通信之间采用高网通信

Expected Behavior

期望能够正常保存模型

Actual Behavior

保存模型hang住了

Environment

  • slime version:
  • Python version:
  • PyTorch version:
  • CUDA/ROCm version:
  • GPU type and count:
  • OS:
  • SGLang version (if relevant):
  • Megatron-LM version (if relevant):

Logs

Additional Context

No response

Pre-submission Checklist

  • I have read the CONTRIBUTING.md and understand the collaboration scope.
  • I have read the documentation and my issue is not addressed there.
  • I have searched for existing issues and this is not a duplicate.
  • I have provided a minimal, reproducible example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions