[train][agent] Add Anthropic Messages API endpoint, fix LoRA weight swap, and improve transition-to-training-data logic

## Overview

  Tracking a set of changes across `skyrl-train` and `skyrl-agent` that will be submitted as a PR. Summarizing here for visibility and early feedback.

  ---

  ### 1. Anthropic Messages API endpoint (`/v1/messages`) — `skyrl-train`

  Adds a `/v1/messages` endpoint compatible with the Anthropic Messages API across the full inference engine stack, allowing agents using the Claude SDK (or any Anthropic-compatible client) to talk directly to a
  vLLM/SGLang backend served by SkyRL's inference engine server.

  - `InferenceEngineInterface`: new `anthropic_messages()` abstract method
  - `InferenceEngineClient`: routes requests using session-based sticky routing (same logic as `chat_completion`)
  - HTTP endpoint (`inference_engine_client_http_endpoint.py`): FastAPI `POST /v1/messages` with input validation and proper HTTP status mapping for Anthropic error types
  - `RemoteInferenceEngine` / `RayWrappedInferenceEngine`: forwarding implementations
  - `AsyncVLLMInferenceEngine`: full implementation — converts Anthropic → OpenAI chat format, calls `chat_completion`, converts response back (including `stop_reason`, `usage`, content blocks)
  - SGLang and sync vLLM: raise `NotImplementedError` with TODOs

  ---

  ### 2. Improved LoRA weight swap in `AsyncVLLMInferenceEngine` — `skyrl-train`

  Replaces naive LoRA loading with a proper swap: aborts in-flight requests, removes old adapter, loads new one, resets prefix cache, and tracks `_active_lora_id`. This is primarily needed when LoRA fine-tuning is used
  with the HTTP inference endpoint enabled — without this fix, requests going through `OpenAIServingChat` bypass the LoRA adapter entirely and generate from the base model, causing training to proceed on incorrect         
  rollouts. A monkey patch on `_maybe_get_adapters` ensures both the direct `generate()` path and the OpenAI HTTP-endpoint path use the same `_active_lora_id`, keeping them consistent across weight updates.

  ---

  ### 3. Improved `transitions_to_training_data` — `skyrl-agent`

  Rewrote the accumulator logic for robustness:
  - Validation for None/empty observations, actions, and token lists
  - Proper handling of missing or mismatched logprobs: marks the whole datum as `response_logprobs=None` rather than silently padding with zeros. In agentic training, logprobs may legitimately be unavailable for some
  transitions (e.g. externally-generated actions). Padding those with `0.0` and treating them as valid would produce incorrect importance sampling ratios during off-policy correction, leading to silent training errors.
  Setting `response_logprobs=None` instead signals downstream that no correction should be applied for that datum.
  - Explicit length-mismatch sanity checks with detailed error messages
  - Cleaner naming and inline invariant documentation

  ---

  ### 4. Fix `TrainingInputBatch` construction — `skyrl-train`

  Only adds `rollout_logprobs` and `is_last_step` to the batch dict when non-`None`, avoiding TensorDict wrapping `None` as `NonTensorData`. Uses `isinstance(..., torch.Tensor)` guard when reading `rollout_logprobs` back
  out.

  ---

  ### 5. Fix `load_checkpoints()` indexing in agent trainer — `skyrl-agent`

  `load_checkpoints()` returns `Tuple[int, str]`; added `[0]` to extract just `global_step`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][agent] Add Anthropic Messages API endpoint, fix LoRA weight swap, and improve transition-to-training-data logic #1222

Overview

1. Anthropic Messages API endpoint (`/v1/messages`) — `skyrl-train`

2. Improved LoRA weight swap in `AsyncVLLMInferenceEngine` — `skyrl-train`

3. Improved `transitions_to_training_data` — `skyrl-agent`

4. Fix `TrainingInputBatch` construction — `skyrl-train`

5. Fix `load_checkpoints()` indexing in agent trainer — `skyrl-agent`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[train][agent] Add Anthropic Messages API endpoint, fix LoRA weight swap, and improve transition-to-training-data logic #1222

Description

Overview

1. Anthropic Messages API endpoint (/v1/messages) — skyrl-train

2. Improved LoRA weight swap in AsyncVLLMInferenceEngine — skyrl-train

3. Improved transitions_to_training_data — skyrl-agent

4. Fix TrainingInputBatch construction — skyrl-train

5. Fix load_checkpoints() indexing in agent trainer — skyrl-agent

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Anthropic Messages API endpoint (`/v1/messages`) — `skyrl-train`

2. Improved LoRA weight swap in `AsyncVLLMInferenceEngine` — `skyrl-train`

3. Improved `transitions_to_training_data` — `skyrl-agent`

4. Fix `TrainingInputBatch` construction — `skyrl-train`

5. Fix `load_checkpoints()` indexing in agent trainer — `skyrl-agent`