Skip to content

[WIP] Fast-LLM trainer integration with vLLM v1 weight broadcast — handover#140

Draft
bigximik wants to merge 84 commits intomainfrom
fast-llm
Draft

[WIP] Fast-LLM trainer integration with vLLM v1 weight broadcast — handover#140
bigximik wants to merge 84 commits intomainfrom
fast-llm

Conversation

@bigximik
Copy link
Copy Markdown
Collaborator

@bigximik bigximik commented May 6, 2026

Status: WIP — handover from Denis (2026-05-06)

This branch is not ready to merge. It's the in-progress integration of Fast-LLM as an alternative trainer to DeepSpeed, with weight broadcast to vLLM v1 over a persistent NCCL group instead of HTTP. I'm leaving the integration project — this PR captures everything needed to pick it up.

Read this first: docs/FAST_LLM_INTEGRATION.md — canonical handover (architecture, per-file changes, glossary, all known issues with file:line citations, testing guide, operations notes, open questions).

Stats: 79 commits ahead of main, ~8 400 insertions / 195 deletions across 35 files (mostly new tests + integration plumbing + handover docs).

What works today

  • Fast-LLM (gspo branch) trainer launches under torchrun, joins a persistent NCCL broadcast group, and pushes weights to vLLM v1 workers in place. No HTTP weight upload.
  • Coordinated NCCL teardown (training_finished event over redis → vLLM destroys process group → both sides hit the collective barrier together) — dist.destroy_process_group() no longer hangs.
  • 4-node multi-node smoke verified end-to-end on both fast-llm GSPO and DeepSpeed PPO (see "Smoke result" below).
  • GSPO loss math matches DeepSpeed exactly: grad_norm parity, grpo_new_logprobs matches step-by-step over a 400-step run (see chart below).

Companion Fast-LLM PR

This PipelineRL branch pins to the gspo branch in Fast-LLM (PR #502). The Fast-LLM PR contains:

  • GSPO loss kernel (sequence-level geometric-mean IS-ratio clipping)
  • Decoupled loss/gradient divisors (loss /num_documents, grad /num_documents²) + SDP loss correction — exact match to DeepSpeed's 1/batch_size dual-factor math
  • fp32_lm_head flag matching vLLM's bf16_last_layer_fp32 precision (otherwise IS ratios drift)
  • metrics: GRPOMetricsLevel enum (none/basic/with_entropy) — merged from PR #494 (Joel's metrics refactor)

Once that PR merges to Fast-LLM main, the README install step here should be revved from git checkout gspogit checkout main and this PR rebased onto a fresh main.

What's NOT done yet

  • Fix actor _prefetch_to_doc_target overshoot (pipelinerl/actor.py:613). Causes premature run end on long runs (50+ steps). Workaround: bump max_train_steps ~20%. Real fix: trainer signals "done" instead of actor inferring.
  • Address rollout retry exhaustion under bursts (pipelinerl/async_llm.py:137-146). Two consecutive aborts can drop a rollout permanently. Allow more retries or evict stuck rollouts.
  • Investigate reward lag vs DS (~2-point gap at step 400 in actor/reward_mean — see chart below). Root cause unknown; newlp parity is confirmed so the gap is upstream of the trainer.
  • Resolve commented-out pyproject.toml overrides (pyproject.toml:81-87). The [tool.uv] block force-overrides transformers>=4.51.0 / accelerate>=1.7.0 because tapeagents==0.1.16 pins them lower; [tapeagents] extra is broken at runtime. Either bump tapeagents or drop the extra on this branch.
  • Close fast-llm finetune metric gaps, e.g. rl/ess (effective sample size — diagnostic for data/policy drift).
  • Bump base image + vLLM version. Currently pinned to interactive-toolkit:25.12-py3-vllm014rc1redis (PyTorch 25.12, vLLM 0.14.0rc1). Move to the latest base PyTorch + vLLM that both Fast-LLM and PipelineRL support; re-run smoke after.

Known issues (with code references)

Issue Symptom Site Memory ref
Actor overshoot ends runs early TimeoutError: No document received after 600s near final step pipelinerl/actor.py:158, 613-614 project_actor_samples_target_overshoot_bug.md
Rollout retry exhaustion Rollout stuck in actor's in_progress after attempt=2/2 abort pipelinerl/async_llm.py:137-146 project_stall_investigation.md
Reward lag vs DS actor/reward_mean ~2 points below DS at step 400 unknown (upstream of trainer) project_fastllm_reward_lag_after_gspo_fix.md

Current limitation (not a bug): streams=files is not implemented for use_fast_llm=true — Fast-LLM only ships RedisStreamingDataset. Use streams=redis. See project_streams_files_not_supported_fast_llm.md.

Training curves (400-step run): fast-llm GSPO vs DeepSpeed GSPO

Compared runs:

  • fast-llm: math_7b_4node_fastllm_gspo_20260505_122944 (divisor² + SDP fix)
  • DS: math_7b_ds_fastllm_4node_20260428_135427 (matching GSPO config: policy_loss=gspo, epsilon_low=3e-3, 400 steps)

new_logprobs — fast-llm matches DS step-by-step (the GSPO loss math fix is correct):

new_logprobs

actor/reward_mean — fast-llm lags DS by ~2 points at step 400 (open issue):

reward_mean

How to verify locally

See examples/interactive/fast_llm_4node.sh and examples/interactive/ds_4node.sh — both follow the README install.

# inside an interactive 4-node EAI session, after the README install:
bash examples/interactive/fast_llm_4node.sh   # fast-llm + vLLM v1 + GSPO
bash examples/interactive/ds_4node.sh         # DeepSpeed + vLLM v1 + PPO (reference)

Both run a 2-step smoke and finish in ~10 minutes. Override MAX_TRAIN_STEPS=N for longer runs.

Smoke result (last verified 2026-05-06)

Smoke EAI Job Step 1 grad_norm Step 2 grad_norm Step 1 newlp Step 2 newlp NaN
fast-llm GSPO 59f3b62f 0.166 0.173 -0.171 -0.162 0
DeepSpeed PPO 084ef7d8 0.201 0.247 -0.162 -0.146 0

Per-step wall time ~80–120 s for both — fast-llm and DS run at comparable speed at this scale.

Code change summary

See docs/FAST_LLM_INTEGRATION.md §5 "Per-file changes" for the file-by-file table. Highlights:

  • pipelinerl/launch.py: TCPStore pre-creation for broadcast rendezvous (workaround for torchrun client-only TORCHELASTIC_USE_AGENT_STORE=True); fast_llm.callbacks.streaming.broadcast.* injection.
  • pipelinerl/state.py: fast-llm event-stream listener thread; samples_processed=0 initialization to avoid startup deadlock.
  • pipelinerl/vllm1.py: init_actor_update_group/destroy_actor_update_group with WEIGHTS_BROADCAST_PG_NAME; training_finished handler for coordinated NCCL teardown.
  • pipelinerl/async_llm.py: rollout retry on vLLM aborted request (weight-update collision).
  • tests/: weight-broadcast tests (test_vllm1_fast_llm_broadcast.py), full vLLM v1 integration (test_vllm1_integration.py), multi-node topology (test_world_multinode.py), actor error handling.

Reviewer checklist

This is a draft PR for handover, not for merge. Reviewer should:

  1. Read docs/FAST_LLM_INTEGRATION.md end-to-end.
  2. Skim README §"Install FastLLM+PipelineRL" — it should reproduce on a fresh interactive job.
  3. Run bash examples/interactive/fast_llm_4node.sh and confirm step 1-2 metrics in finetune/stdout_node0.log.
  4. Pick up the TODO list above; create separate issues/PRs for each item.

rafapi and others added 30 commits December 12, 2025 14:09
… better abab pattern detection in generations results to test weight bradcast correctnes, some refactoring
[WIP] Adding tests to vllm actor for Fast-LLM integration
bigximik added 30 commits April 23, 2026 11:33
Re-adds EngineManager.receive_weight_update_fast_llm() with the same
pause/resume wrap as the HTTP path, fixing the logprob-drift regression
that was introduced when the wrap was dropped in b753ece.

Root cause of the previous deadlock: the first weights_ready event arrives
before the actor has started generating (it is blocked in
wait_for_model_version), so vLLM has zero in-flight requests at that point.
pause_generation(wait_for_inflight_requests=True) then blocks forever
waiting for requests that never come, hanging the NCCL collective.

Fix: first_weights_ready_seen boolean in monitor_redis_stream. The first
weights_ready event (step can be 0 on a fresh start or k on resume) takes
the raw collective_rpc_async path with no pause wrap — matching the prior
behaviour that worked. Every subsequent event calls
receive_weight_update_fast_llm() which does pause → RPC → resume.

Also fixes both pause_generation call sites: the old mode="keep" kwarg was
removed in vllm 0.14.0rc1+; replaced with wait_for_inflight_requests=True,
clear_cache=False (equivalent semantics: drain existing requests without
aborting them, preserve KV cache).

Verified: 10/10 iterations on counting task, ~2s per weight update
(pause → NCCL → resume), no deadlocks or NCCL timeouts.
pause_generation(mode="keep") freezes requests mid-generation and they
get aborted during the NCCL flush, producing empty logprobs.  The fix
is mode="wait" which drains all in-flight decodes before the weight
update and resumes cleanly after.

Add _pause_generation() that detects the installed vLLM API at runtime
via inspect.signature and calls mode="wait" on newer builds or the
equivalent wait_for_inflight_requests=True on older ones.

Also remove the guessing.py band-aid that silently dropped aborted
rollouts — with proper draining the band-aid is unnecessary and hides
real failures.
- launch.py: fix _run_finetune_deepspeed to use absolute path for
  run_finetune.py (relative path resolved against /home/toolkit in EAI pods)
- launch.py: add stdout/stderr log capture to _run_finetune_deepspeed
- launch.py: fix dns_address_map saved after address_map overwrite (captured
  pod IPs instead of DNS names); save before the pod IP exchange loop
- launch.py: fix assertion crashes in main() and _run_finetune_deepspeed
  after pod IP exchange by using dns_address_map fallback for hostfile/filter
- launch.py: set streams.host to rank-0 pod IP in multi-node so DeepSpeed
  workers on other nodes can reach Redis
- world.py: expose dns_address_map attribute for post-exchange DNS lookups
- tests/test_world_multinode.py: 33 tests covering pod IP exchange, DeepSpeed
  DNS name invariants, hostfile, Redis host, and absolute entrypoint path
- README.md: add Multi-Node Requirements section (ports, config params,
  per-role connection topology, assumptions)
- remove submit_eai_math_7b_8gpu.sh (local-only convenience script)

Both fast-llm and DeepSpeed 2-node jobs validated to complete training step 1.
…S races

Both fast-llm and DeepSpeed finetune paths now append _node{rank} to all
per-node output files (config, start.sh, stdout.log, stderr.log) when running
across multiple finetune nodes. Single-node runs keep the original names.
Add logprobs-mode=processed_logprobs to the V1 defaults in _get_vllm_kwargs
alongside the existing prefix-caching and async-scheduling fixes.

processed_logprobs returns log-probs from the forward pass rather than
recomputing them, preventing stale values when weights change between
generation and scoring.
Align with the approach in PR #137: pause accepts new requests but lets
in-flight ones finish naturally, rather than draining the engine fully
before the weight update begins.
When vLLM aborts an in-flight request during a weight update pause it
returns finish_reason='abort' with empty logprobs.  Previously this
propagated to make_training_text which raised ValueError and crashed the
entire actor.  Raise asyncio.TimeoutError instead so the actor's
existing retry logic replays the rollout cleanly.
…mit script

Stale .pod_ips files from a previous run caused the pod IP exchange to
return immediately with old IPs — slow-starting ranks were never waited
for, breaking torchrun rendezvous on resume.  clean_up() now removes the
directory so every run waits for all live ranks to write fresh IPs.

Submit script appends a timestamp to resume job names so EAI does not
reject them as duplicates.
Stale .pod_ips files from a previous job caused rank 0 to complete the
exchange with wrong IPs before other ranks had even started. Then
clean_up() deleted rank_0.txt, leaving ranks 1-N waiting forever.

Rank 0 now atomically wipes the old directory and writes a UUID session
token before any rank writes its IP. Non-zero ranks block on the session
token, so they only write after rank 0 has cleared stale data.

Remove the incorrect pod_ips deletion from clean_up() (it was too late:
exchange already complete, and it wiped rank_0.txt other ranks needed).
The UUID approach was broken: a non-zero rank arriving before rank 0
would see the stale session UUID from the previous job, skip waiting,
write its IP — then rank 0 would wipe the dir (deleting the fresh IP)
and write a new UUID. Rank 0 then waits forever for that rank's file.

Use the rank-0 DNS name from MASTER_ADDR as the token instead. It is
unique per EAI job (contains a job UUID), so non-zero ranks reject a
stale session by comparing token content to their own MASTER_ADDR.
Replace session-token logic with a per-job subdirectory under .pod_ips/.
The subdir name is MASTER_ADDR (a standard distributed-launcher env var,
unique per job), so stale files from previous runs are simply never seen
— no wiping, no barriers, no coordination needed.

This removes the EAI-specific dependency on the dns_address_map naming
convention and works with any launcher that sets MASTER_ADDR.
Add world.run_id config field (default null). The call site resolves it
as: cfg.world.run_id if set, else $MASTER_ADDR, else "default".
On EAI/torchrun MASTER_ADDR is unique per job so the default works
out-of-the-box; other systems can set world.run_id explicitly.

Remove the MASTER_ADDR hardcoding from _exchange_pod_ips itself.
world.run_id must now be set explicitly for multi-node jobs — no silent
fallback to MASTER_ADDR. Raises ValueError if unset, RuntimeError if
the run_id dir already exists (duplicate or stale run detected early).

Rank 0 exclusively creates the dir; non-zero ranks wait for it, so the
existence check is unambiguous: if the dir is there when rank 0 arrives,
it is from a previous job.

Submit script passes world.run_id=${MASTER_ADDR} so EAI jobs are unique
per replica-group without any manual intervention.
- Drop use_v1 toggle: vLLM V1 is now always used (remove use_v1 config
  field, V0 legacy flags, and conditional entrypoint selection)
- launch.py: _get_vllm_kwargs no longer takes use_v1 param; always drops
  V0 legacy flags; num-scheduler-steps dropped unconditionally for V1
- vllm1.py HTTP path: add timing/version logging for pause/update/resume
  (from vllm_v1); keep _pause_generation helper (drains in-flight requests)
  and self.engine.engine_core (not engine_client which isn't in HEAD init)
- vllm1.py fast-llm path: propagate same timing/version logging to
  receive_weight_update_fast_llm for parity with HTTP path
…nt loop

Blocking put on a full queue stalled the asyncio event loop (test_actor_stall_fixed).
Delete from group_rollouts before the await to prevent double-processing.
ServerDisconnectedError is a transient failure (vLLM event loop briefly
blocked during synchronized post-weight-update response burst) — add it
to retryable_rollout_exceptions so the actor backs off and retries instead
of crashing the whole job.

conf/math.yaml: remove use_v1: true left over from before the always-v1
switch; was missed in the 13a42bf merge cleanup.
- Remove single quotes around world.run_id=\${MASTER_ADDR} so bash expands
  MASTER_ADDR in the container (pod IP exchange was hanging because OmegaConf
  tried to resolve the literal string '${MASTER_ADDR}' as a config key)
- Add + prefix to fast_llm.schedule.docs_per_step (new field not in base.yaml
  struct, requires append syntax)
- Add DS submit script for fast-llm branch (submit_eai_math_7b_multinode_ds_vllm_v1.sh)
- Set max_ready_samples_per_lead: 64 (was 512) to match reference branch
- Add monitor_jobs.sh for polling EAI job status
Top-level `fp32_lm_head=true` is rejected after main merge (launch.py warns and exits).
Fast-LLM-side override `+fast_llm.model.base_model.head.fp32_lm_head=true` still works
and is kept. Also replaces removed `compute_extra_metrics=true` with new PR #494 enum
`metrics=with_entropy`.
Adds canonical handover documentation for the fast-llm trainer integration,
since this branch is WIP and being handed off:

- docs/FAST_LLM_INTEGRATION.md: architecture, per-file changes, configuration
  knobs, glossary, known issues with file:line citations, testing guide,
  operations notes, and open questions for the successor.
- examples/interactive/fast_llm_4node.sh, ds_4node.sh: 2-step smoke runs that
  mirror the EAI submit scripts but execute in the current shell. Default to
  MAX_TRAIN_STEPS=2 for verification; bump for real runs.
- README.md: refresh stale install steps (gspo branch in Fast-LLM, not
  jlp_pipeline_rl), call out pyproject.toml tapeagents caveat, add a
  "Fast-LLM trainer path (preview)" subsection under §5 Trainer pointing to
  the canonical doc.

No code changes. Functional behavior unchanged.
Drop sections that were nice-to-have ideas, not real code TODOs:
- streams=files / +finetune.max_lag (speculation about reward-lag fix)
- Step progress heartbeat (no actual TODO in Fast-LLM runner.py)
- xreadgroup count=1 perf (perf speculation, no measurement)
- Data logging stash (debug tool, not handover-critical)

Tighten reward-lag entry: drop the unverified streams-staleness theory and
"investigations to try" list. Reframe streams=files as a current limitation,
not a fix-needed item.

Real measured issues (actor overshoot, rollout retry exhaustion, reward lag
investigation needed) stay.
- Embed reward_mean and new_logprobs charts (fast-llm GSPO vs DeepSpeed GSPO,
  400-step run, eps=3e-3): newlp matches step-by-step; reward lags ~2 points
  at step 400.
- Compared runs: fast-llm math_7b_4node_fastllm_gspo_20260505_122944
  (divisor² + SDP fix) vs DS math_7b_ds_fastllm_4node_20260428_135427.
- Add open questions for the successor:
  * Resolve commented-out pyproject.toml [tool.uv] tapeagents overrides
    (transformers/accelerate pins; [tapeagents] extra broken at runtime).
  * Close metric coverage gap on fast-llm finetune side (start with rl/ess).
Note that the interactive-toolkit:25.12-py3-vllm014rc1redis image is built
from the fml/pytorch_vllm014rc1 branch of ServiceNow/research-interactive-
toolkit (SN-internal). Base layer nvcr.io/nvidia/pytorch:25.12-py3, branch
adds vLLM 0.14.0rc1, redis, and EAI helpers.
.research-interactive-env values are for *building* the image (in the
research-interactive-toolkit repo on branch fml/pytorch_vllm014rc1), not
for using the prebuilt one. Reword both README and handover doc so that
"use" just means referencing the image URI, and "build" is a separate
flow with the env config.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants