Bug Description
当我使用异步训练的时候,二台机器,一台head 节点提交训练任务,8卡用于训练(master节点),一台worker节点4台用于rollout, 训练模型为Qwen3.5-35B-A3B, 训练数据为多模态数据, 训练模型分片为 --megatron-to-hf-mode bridge --train-backend megatron
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 8
--expert-tensor-parallel-size 1, 其中直接跑在update_weight会出错,后面修改了分布式更新权重方式与colocate模式下保存一样(使用megatron to hf mode为bridge中封装的权重收集方式来更新模型),但是在保存模型阶段,程序会hang在以下call stack 目前8卡进程都在一个地方Thread 20870 (idle): "MainThread"
gather (torch/distributed/distributed_c10d.py:4264)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/distributed_c10d.py:3312)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/checkpoint/utils.py:135)
save_state_dict_async_plan (core/dist_checkpointing/strategies/state_dict_saver.py:141)
async_save (core/dist_checkpointing/strategies/torch.py:797)
save (core/dist_checkpointing/strategies/base.py:223)
save (core/dist_checkpointing/strategies/fully_parallel.py:98)
save (core/dist_checkpointing/serialization.py:425)
save_checkpoint (training/checkpointing.py:573)
save (slime/backends/megatron_utils/model.py:705)
save_model (slime/backends/megatron_utils/actor.py:530)
wrapper (slime/utils/timer.py:78)
_resume_span (ray/util/tracing/tracing_helper.py:461)
actor_method_executor (ray/_private/function_manager.py:704)
main_loop (ray/_private/worker.py:1028)
(ray/_private/workers/default_worker.py:323)
Thread 21284 (idle): "PythonGCThread"
wait (threading.py:355)
wait (threading.py:655)
run (ray/_private/gc_collect_manager.py:29)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32260 (idle): "InductorSubproc"
_recv_msg (torch/_inductor/compile_worker/subproc_pool.py:73)
_read_thread (torch/_inductor/compile_worker/subproc_pool.py:228)
run (threading.py:1010)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32261 (idle): "Thread-1"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030) 而在使用单机colocate模式下保存可以正常进行,同时如果将保存格式使用torch替换默认的torch_dist也可以保存成功
Steps to Reproduce
如上述描述程序进行,多机通信之间采用高网通信
Expected Behavior
期望能够正常保存模型
Actual Behavior
保存模型hang住了
Environment
- slime version:
- Python version:
- PyTorch version:
- CUDA/ROCm version:
- GPU type and count:
- OS:
- SGLang version (if relevant):
- Megatron-LM version (if relevant):
Logs
Additional Context
No response
Pre-submission Checklist
Bug Description
当我使用异步训练的时候,二台机器,一台head 节点提交训练任务,8卡用于训练(master节点),一台worker节点4台用于rollout, 训练模型为Qwen3.5-35B-A3B, 训练数据为多模态数据, 训练模型分片为 --megatron-to-hf-mode bridge --train-backend megatron
--tensor-model-parallel-size 2
--sequence-parallel
--pipeline-model-parallel-size 1
--context-parallel-size 1
--expert-model-parallel-size 8
--expert-tensor-parallel-size 1, 其中直接跑在update_weight会出错,后面修改了分布式更新权重方式与colocate模式下保存一样(使用megatron to hf mode为bridge中封装的权重收集方式来更新模型),但是在保存模型阶段,程序会hang在以下call stack 目前8卡进程都在一个地方Thread 20870 (idle): "MainThread"
gather (torch/distributed/distributed_c10d.py:4264)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/distributed_c10d.py:3312)
wrapper (torch/distributed/c10d_logger.py:81)
gather_object (torch/distributed/checkpoint/utils.py:135)
save_state_dict_async_plan (core/dist_checkpointing/strategies/state_dict_saver.py:141)
async_save (core/dist_checkpointing/strategies/torch.py:797)
save (core/dist_checkpointing/strategies/base.py:223)
save (core/dist_checkpointing/strategies/fully_parallel.py:98)
save (core/dist_checkpointing/serialization.py:425)
save_checkpoint (training/checkpointing.py:573)
save (slime/backends/megatron_utils/model.py:705)
save_model (slime/backends/megatron_utils/actor.py:530)
wrapper (slime/utils/timer.py:78)
_resume_span (ray/util/tracing/tracing_helper.py:461)
actor_method_executor (ray/_private/function_manager.py:704)
main_loop (ray/_private/worker.py:1028)
(ray/_private/workers/default_worker.py:323)
Thread 21284 (idle): "PythonGCThread"
wait (threading.py:355)
wait (threading.py:655)
run (ray/_private/gc_collect_manager.py:29)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32260 (idle): "InductorSubproc"
_recv_msg (torch/_inductor/compile_worker/subproc_pool.py:73)
_read_thread (torch/_inductor/compile_worker/subproc_pool.py:228)
run (threading.py:1010)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030)
Thread 32261 (idle): "Thread-1"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1073)
_bootstrap (threading.py:1030) 而在使用单机colocate模式下保存可以正常进行,同时如果将保存格式使用torch替换默认的torch_dist也可以保存成功
Steps to Reproduce
如上述描述程序进行,多机通信之间采用高网通信
Expected Behavior
期望能够正常保存模型
Actual Behavior
保存模型hang住了
Environment
Logs
Additional Context
No response
Pre-submission Checklist