[Bug] 保存模型hang住

### Bug Description

当我使用异步训练的时候，二台机器，一台head 节点提交训练任务，8卡用于训练（master节点），一台worker节点4台用于rollout, 训练模型为Qwen3.5-35B-A3B, 训练数据为多模态数据， 训练模型分片为 --megatron-to-hf-mode bridge  --train-backend megatron 
  --tensor-model-parallel-size 2
  --sequence-parallel
  --pipeline-model-parallel-size 1
  --context-parallel-size 1
  --expert-model-parallel-size 8
  --expert-tensor-parallel-size 1， 其中直接跑在update_weight会出错，后面修改了分布式更新权重方式与colocate模式下保存一样（使用megatron to hf mode为bridge中封装的权重收集方式来更新模型），但是在保存模型阶段，程序会hang在以下call stack   目前8卡进程都在一个地方Thread 20870 (idle): "MainThread"
    gather (torch/distributed/distributed_c10d.py:4264)
    wrapper (torch/distributed/c10d_logger.py:81)
    gather_object (torch/distributed/distributed_c10d.py:3312)
    wrapper (torch/distributed/c10d_logger.py:81)
    gather_object (torch/distributed/checkpoint/utils.py:135)
    save_state_dict_async_plan (core/dist_checkpointing/strategies/state_dict_saver.py:141)
    async_save (core/dist_checkpointing/strategies/torch.py:797)
    save (core/dist_checkpointing/strategies/base.py:223)
    save (core/dist_checkpointing/strategies/fully_parallel.py:98)
    save (core/dist_checkpointing/serialization.py:425)
    save_checkpoint (training/checkpointing.py:573)
    save (slime/backends/megatron_utils/model.py:705)
    save_model (slime/backends/megatron_utils/actor.py:530)
    wrapper (slime/utils/timer.py:78)
    _resume_span (ray/util/tracing/tracing_helper.py:461)
    actor_method_executor (ray/_private/function_manager.py:704)
    main_loop (ray/_private/worker.py:1028)
    <module> (ray/_private/workers/default_worker.py:323)
Thread 21284 (idle): "PythonGCThread"
    wait (threading.py:355)
    wait (threading.py:655)
    run (ray/_private/gc_collect_manager.py:29)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 32260 (idle): "InductorSubproc"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:73)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:228)
    run (threading.py:1010)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)
Thread 32261 (idle): "Thread-1"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1073)
    _bootstrap (threading.py:1030)   而在使用单机colocate模式下保存可以正常进行，同时如果将保存格式使用torch替换默认的torch_dist也可以保存成功


### Steps to Reproduce

如上述描述程序进行，多机通信之间采用高网通信

### Expected Behavior

期望能够正常保存模型

### Actual Behavior

保存模型hang住了

### Environment

- slime version:
- Python version:
- PyTorch version:
- CUDA/ROCm version:
- GPU type and count:
- OS:
- SGLang version (if relevant):
- Megatron-LM version (if relevant):


### Logs

```shell

```

### Additional Context

_No response_

### Pre-submission Checklist

- [x] I have read the [CONTRIBUTING.md](https://github.com/THUDM/slime/blob/main/CONTRIBUTING.md) and understand the collaboration scope.
- [x] I have read the [documentation](https://thudm.github.io/slime/) and my issue is not addressed there.
- [x] I have searched for [existing issues](https://github.com/THUDM/slime/issues) and this is not a duplicate.
- [x] I have provided a minimal, reproducible example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 保存模型hang住 #1908

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Logs

Additional Context

Pre-submission Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] 保存模型hang住 #1908

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Logs

Additional Context

Pre-submission Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions