[skyrl][RFC] VLM Training in SkyRL with the SkyRL train backend

# VLM Training in SkyRL

This document outlines the changes required for SkyRL/skyrl to support vision inputs, similar to the official tinker implementation. The proposed changes are only applicable to the pytorch / skyrl-train backend. The jax backend is out of scope.

_Test Case_: The supervised vision language training example ([here](https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/recipes/vlm_classifier)) should run and converge. The VLM SFT example should work out of the box. The task is to classify CalTech101, a very standard image object recognition benchmark (expected accuracy >90%, especially post-SFT).

An example implementation is on this [branch](https://github.com/nithinvc/SkyRL/pull/2/changes).

## Control Flow Overview
The largest change is passing the list of `ModelInputChunks` to the SkyRL-train backend. This is to reconcile the fact that pytorch / huggingface expects sequences with image token padding while VLLM inserts image padding within the engine. The proposed workaround is to pass the `ImageChunk`s to SkyRL-train which decides, based on the method call, whether to add image pad tokens or not.

<img width="3469" height="1387" alt="Image" src="https://github.com/user-attachments/assets/e0cf06d4-1237-4def-9429-9fc7acd6379e" />

## Tokenization Responsibilities
The client handles tokenization of text and encoding images into bytes. The server then has two code paths: 
1) Sampling through vllm and 
2) Training with huggingface / fsdp2.

VLLM: skyrl-train will be responsible for converting the bytes64 into a pillow image. This will then be passed through a VLLM [renderer](https://docs.vllm.ai/en/v0.10.2/api/vllm/entrypoints/renderer.html).
 
HF / FSDP2 Training: Because the datum is a list of encoded chunks which need to get concatenated, the skyrl-train backend should:
1. Patchify the image
2. Compute the expected number of tokens (validate if provided by client)
3. Concatenate input chunks to create the model input

## Tasks Breakdown
There’s a lot of moving pieces to get to a full VLM SFT implementation. Instead, here are a set of incremental changes which should preserve backwards compatibility and interoperability with the Jax backend.

**1. Inference Engine Piping**
Changes localized around the `sample` workflow. We will only support the new inference stack (`inference_engine_client_http_endpoint.py`).
- [x] Inference engine client call for `/v1/chat/completion/render` + inclusion of multi-modal data.
- [x] Implement the sample API which performs the generation by calling `/inference/v1/generate`.
- [ ] [VLLM Upstream] Implement multi-modal features in `inference/v1/generate`. [Here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/serve/disagg/protocol.py#L33) - renderer refactor is completed, this is primarily a piping issue.
- [ ] InferenceEngineClient calls the VLLM renderer endpoint to get a `GenerateRequest`, which is expected by the `v1/generate` api.
- [x] Verify existing RL / SFT training loops work with new inference server stack. 

_Compatibility_: Existing sample calls should not be affected. Multi-modal data piping defaults to None.
_Tests_: a) Integration test with a VL model. Pass a red square and ask it to answer what color it is. b) CI test verifying our pre-processing yields the same result as VLLM, along with a log probabilities comparison test.

**2. Training Integration (FSDP2 only)**
- [x] Pass multi-modal inputs through workers (Policy worker and Ref worker). In worker.py update `_forward_backward_micro` to pull out `multi_modal_inputs` and pass to `model_wrapper.py`.
- [x] Update `model_wrapper.py` forward call signature to accept multi-modal inputs. Pass to FSDP2 wrapped huggingface model.

_Compatibility_: New prompt ID construction is only used when image chunks are present. Otherwise, multi-modal data defaults to None.
_Tests_: Unit tests for `image_chunk_to_image_tensor` and `chunks_to_input_ids`. Update `Experience` and `TrainingInput` tests.

**3. SkyRL-train chunk processing**
- [x] Pass chunks to SkyRL-train backend.
    - [x] `ModelInputChunk` aggregation and pass through for `prepare_sample_batch` 
    - [x] `ModelInputChunk` aggregation and pass through for `prepare_model_pass_batch`
- [x] Update the jax backend to accept chunks, but no-op with them.
- [x] In SkyRL-train, when chunks are provided, construct text only inputs (no vision yet).
- [x] Remove concatenation within the engine, and move it to the backend. Comparing between git commits should yield the same result.

_Compatibility_: Existing codepaths expecting a list[int] for prompt-ids should still work.
_Tests_: Multiple encoded text chunks should yield the same result as a single encoded text chunk. Similarly multiple encoded text chunks should be the same as not passing `ModelChunks` to the backend.

**4. Core tinker types expansion**
- [x] Extend the definition of `ModelInputChunk` beyond text (generalized `InputChunk`). [Tinker SDK](https://github.com/thinking-machines-lab/tinker/blob/main/src/tinker/types/image_chunk.py) 
- [x] Add `ImageChunk` definition from SDK.
- [x] [New `renderer.py`] Create a renderer which uses VLLM to process vision inputs and create an appropriate `token_ids` and `pixel_values` for the training backend.

_Compatibility_: No new control paths for text. Vision language training.
_Tests_: Test initialization and chunk validation logic. Correctness test via CalTech101 tinker cookbook.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[skyrl][RFC] VLM Training in SkyRL with the SkyRL train backend #1200

VLM Training in SkyRL

Control Flow Overview

Tokenization Responsibilities

Tasks Breakdown

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[skyrl][RFC] VLM Training in SkyRL with the SkyRL train backend #1200

Description

VLM Training in SkyRL

Control Flow Overview

Tokenization Responsibilities

Tasks Breakdown

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions