fix: drop incorrect critic GPU add to rollout_num_gpus in colocate mode#1950
Open
aoshen02 wants to merge 1 commit into
Open
fix: drop incorrect critic GPU add to rollout_num_gpus in colocate mode#1950aoshen02 wants to merge 1 commit into
aoshen02 wants to merge 1 commit into
Conversation
In colocate mode `slime/ray/placement_group.py:_create_placement_group`
allocates only `actor_num_gpus_per_node * actor_num_nodes` bundles in the
placement group (colocate branch). The critic shares the actor's PG
(`result["critic"] = result["actor"]`) and so contributes no additional
slots — actor, critic, and rollout time-share the same physical GPUs via
offload/onload.
The two lines being removed add `critic_num_gpus_per_node *
critic_num_nodes` onto `args.rollout_num_gpus`, which over-allocates the
regular `ServerGroupConfig.num_gpus` past the PG size. `start_engines`
(`slime/ray/rollout.py`) then iterates `len(all_engines) = rollout_num_gpus
// gpus_per_engine` and indexes `reordered_gpu_ids[gpu_index]` beyond its
length, raising:
IndexError: list index out of range
at base_gpu_id = int(reordered_gpu_ids[gpu_index])
in ServerGroup.start_engines (slime/ray/rollout.py)
Reproducer: `tests/test_qwen2.5_0.5B_ppo_critic_only_short.py`
(`--advantage-estimator ppo` + `--colocate` +
`--rollout-num-gpus 2` against actor 4 GPU). With the addition still in
place, `rollout_num_gpus` is overridden to `4 + 4 = 8` but the PG only has
4 bundles; without it the rollout sizes correctly to the 4 colocated GPUs.
This is the root cause of THUDM#1896 — PR THUDM#1934 added a defensive boundary
check that surfaces the mismatch as a clearer `ValueError`, but the
underlying resource math is still wrong. This patch removes the source of
the mismatch instead of just catching its symptom.
The blamed `if args.use_critic: args.rollout_num_gpus += ...` lines were
added in 371c030 ([Fix] ppo rollout engins for distribute, THUDM#394), which
only touched `arguments.py` and did not extend `placement_group.py` to
match. The Miles RL framework (a downstream fork) does carry the matching
`placement_group.py` change (critic gets its own bundles in colocate
mode), where the addition would make sense; slime does not, so this code
path was inconsistent in slime since THUDM#394.
Signed-off-by: aoshen02 <aoshen@inferact.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the root cause behind #1896 (which PR #1934 only worked around with a defensive boundary check).
In colocate mode
slime/ray/placement_group.py:_create_placement_groupallocates onlyactor_num_gpus_per_node * actor_num_nodesbundles in the placement group (theelif args.colocate:branch). The critic shares the actor's PG (result[\"critic\"] = result[\"actor\"] if args.use_critic else None) and so contributes no additional slots — actor / critic / rollout time-share the same physical GPUs via offload/onload.But
slime/utils/arguments.pystill addscritic_num_gpus_per_node * critic_num_nodesontoargs.rollout_num_gpusinside the colocate branch:```python
if args.rollout_num_gpus != args.actor_num_gpus_per_node * args.actor_num_nodes:
logger.info(...)
args.rollout_num_gpus = args.actor_num_gpus_per_node * args.actor_num_nodes
if args.use_critic:
args.rollout_num_gpus += args.critic_num_gpus_per_node * args.critic_num_nodes # <-- buggy
```
This over-allocates the regular
ServerGroupConfig.num_gpuspast the PG size.start_engines(slime/ray/rollout.py) then iterates `len(all_engines) = rollout_num_gpus // gpus_per_engine` and indexes `reordered_gpu_ids[gpu_index]` beyond its length, raising the IndexError tracked in #1896:```
IndexError: list index out of range
at base_gpu_id = int(reordered_gpu_ids[gpu_index])
in ServerGroup.start_engines (slime/ray/rollout.py)
```
Reproducer
`tests/test_qwen2.5_0.5B_ppo_critic_only_short.py` — `--advantage-estimator ppo` (so `use_critic=True`) + `--colocate` + `--rollout-num-gpus 2 --actor-num-gpus-per-node 4`. With the addition still in place, `rollout_num_gpus` is overridden to `4 + 4 = 8` but the colocate PG only has 4 bundles. Without it the rollout sizes correctly to the 4 colocated GPUs.
Verified locally on a 4-GPU H200 setup: the test crashed with the IndexError before this patch, and completes the full 2-step CI loop (`perf 1:` / `perf 2:` / `Job succeeded`) after.
Relationship to #1896 / #1934
This PR fixes the resource math itself.
How the bug got in
`371c030e` ([Fix] ppo rollout engins for distribute, #394) added the `if args.use_critic: args.rollout_num_gpus += ...` lines but only touched `arguments.py`. It did not extend `placement_group.py` to give the critic its own bundles in colocate mode. A downstream fork (Miles, `radixark/miles`) does carry the matching `placement_group.py` change — Miles's colocate branch in `create_placement_groups` adds `if args.use_critic: num_gpus += critic_num_nodes * critic_num_gpus_per_node` and slices a separate `critic_pg_reordered_*`. In that world the `arguments.py` addition is consistent (PG grew, so rollout grows with it). In slime, since #394 only landed half of the change, the two files have been inconsistent and this code path was effectively broken whenever both `use_critic` and `colocate` were set.
Test plan