Skip to content

fix: drop incorrect critic GPU add to rollout_num_gpus in colocate mode#1950

Open
aoshen02 wants to merge 1 commit into
THUDM:mainfrom
aoshen02:aoshen/fix-colocate-critic-rollout-num-gpus
Open

fix: drop incorrect critic GPU add to rollout_num_gpus in colocate mode#1950
aoshen02 wants to merge 1 commit into
THUDM:mainfrom
aoshen02:aoshen/fix-colocate-critic-rollout-num-gpus

Conversation

@aoshen02
Copy link
Copy Markdown

Summary

Closes the root cause behind #1896 (which PR #1934 only worked around with a defensive boundary check).

In colocate mode slime/ray/placement_group.py:_create_placement_group allocates only actor_num_gpus_per_node * actor_num_nodes bundles in the placement group (the elif args.colocate: branch). The critic shares the actor's PG (result[\"critic\"] = result[\"actor\"] if args.use_critic else None) and so contributes no additional slots — actor / critic / rollout time-share the same physical GPUs via offload/onload.

But slime/utils/arguments.py still adds critic_num_gpus_per_node * critic_num_nodes onto args.rollout_num_gpus inside the colocate branch:

```python
if args.rollout_num_gpus != args.actor_num_gpus_per_node * args.actor_num_nodes:
logger.info(...)
args.rollout_num_gpus = args.actor_num_gpus_per_node * args.actor_num_nodes
if args.use_critic:
args.rollout_num_gpus += args.critic_num_gpus_per_node * args.critic_num_nodes # <-- buggy
```

This over-allocates the regular ServerGroupConfig.num_gpus past the PG size. start_engines (slime/ray/rollout.py) then iterates `len(all_engines) = rollout_num_gpus // gpus_per_engine` and indexes `reordered_gpu_ids[gpu_index]` beyond its length, raising the IndexError tracked in #1896:

```
IndexError: list index out of range
at base_gpu_id = int(reordered_gpu_ids[gpu_index])
in ServerGroup.start_engines (slime/ray/rollout.py)
```

Reproducer

`tests/test_qwen2.5_0.5B_ppo_critic_only_short.py` — `--advantage-estimator ppo` (so `use_critic=True`) + `--colocate` + `--rollout-num-gpus 2 --actor-num-gpus-per-node 4`. With the addition still in place, `rollout_num_gpus` is overridden to `4 + 4 = 8` but the colocate PG only has 4 bundles. Without it the rollout sizes correctly to the 4 colocated GPUs.

Verified locally on a 4-GPU H200 setup: the test crashed with the IndexError before this patch, and completes the full 2-step CI loop (`perf 1:` / `perf 2:` / `Job succeeded`) after.

Relationship to #1896 / #1934

This PR fixes the resource math itself.

How the bug got in

`371c030e` ([Fix] ppo rollout engins for distribute, #394) added the `if args.use_critic: args.rollout_num_gpus += ...` lines but only touched `arguments.py`. It did not extend `placement_group.py` to give the critic its own bundles in colocate mode. A downstream fork (Miles, `radixark/miles`) does carry the matching `placement_group.py` change — Miles's colocate branch in `create_placement_groups` adds `if args.use_critic: num_gpus += critic_num_nodes * critic_num_gpus_per_node` and slices a separate `critic_pg_reordered_*`. In that world the `arguments.py` addition is consistent (PG grew, so rollout grows with it). In slime, since #394 only landed half of the change, the two files have been inconsistent and this code path was effectively broken whenever both `use_critic` and `colocate` were set.

Test plan

  • Verified `tests/test_qwen2.5_0.5B_ppo_critic_only_short.py` reproduces the original IndexError on `main` (HEAD `41dc3b6d`)
  • Verified the same test passes 2 RL steps end-to-end with this patch applied
  • No regression in the colocate non-critic path: `tests/test_qwen2.5_0.5B_short.py` still passes
  • Run the full PPO/colocate CI matrix

In colocate mode `slime/ray/placement_group.py:_create_placement_group`
allocates only `actor_num_gpus_per_node * actor_num_nodes` bundles in the
placement group (colocate branch). The critic shares the actor's PG
(`result["critic"] = result["actor"]`) and so contributes no additional
slots — actor, critic, and rollout time-share the same physical GPUs via
offload/onload.

The two lines being removed add `critic_num_gpus_per_node *
critic_num_nodes` onto `args.rollout_num_gpus`, which over-allocates the
regular `ServerGroupConfig.num_gpus` past the PG size. `start_engines`
(`slime/ray/rollout.py`) then iterates `len(all_engines) = rollout_num_gpus
// gpus_per_engine` and indexes `reordered_gpu_ids[gpu_index]` beyond its
length, raising:

  IndexError: list index out of range
    at base_gpu_id = int(reordered_gpu_ids[gpu_index])
    in ServerGroup.start_engines (slime/ray/rollout.py)

Reproducer: `tests/test_qwen2.5_0.5B_ppo_critic_only_short.py`
(`--advantage-estimator ppo` + `--colocate` +
`--rollout-num-gpus 2` against actor 4 GPU). With the addition still in
place, `rollout_num_gpus` is overridden to `4 + 4 = 8` but the PG only has
4 bundles; without it the rollout sizes correctly to the 4 colocated GPUs.

This is the root cause of THUDM#1896 — PR THUDM#1934 added a defensive boundary
check that surfaces the mismatch as a clearer `ValueError`, but the
underlying resource math is still wrong. This patch removes the source of
the mismatch instead of just catching its symptom.

The blamed `if args.use_critic: args.rollout_num_gpus += ...` lines were
added in 371c030 ([Fix] ppo rollout engins for distribute, THUDM#394), which
only touched `arguments.py` and did not extend `placement_group.py` to
match. The Miles RL framework (a downstream fork) does carry the matching
`placement_group.py` change (critic gets its own bundles in colocate
mode), where the addition would make sense; slime does not, so this code
path was inconsistent in slime since THUDM#394.

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant