Skip to content

[train][agent] Add Anthropic Messages API endpoint, fix LoRA weight swap, and improve transition-to-training-data logic #1222

@ashutoshuiuc

Description

@ashutoshuiuc

Overview

Tracking a set of changes across skyrl-train and skyrl-agent that will be submitted as a PR. Summarizing here for visibility and early feedback.


1. Anthropic Messages API endpoint (/v1/messages) — skyrl-train

Adds a /v1/messages endpoint compatible with the Anthropic Messages API across the full inference engine stack, allowing agents using the Claude SDK (or any Anthropic-compatible client) to talk directly to a
vLLM/SGLang backend served by SkyRL's inference engine server.

  • InferenceEngineInterface: new anthropic_messages() abstract method
  • InferenceEngineClient: routes requests using session-based sticky routing (same logic as chat_completion)
  • HTTP endpoint (inference_engine_client_http_endpoint.py): FastAPI POST /v1/messages with input validation and proper HTTP status mapping for Anthropic error types
  • RemoteInferenceEngine / RayWrappedInferenceEngine: forwarding implementations
  • AsyncVLLMInferenceEngine: full implementation — converts Anthropic → OpenAI chat format, calls chat_completion, converts response back (including stop_reason, usage, content blocks)
  • SGLang and sync vLLM: raise NotImplementedError with TODOs

2. Improved LoRA weight swap in AsyncVLLMInferenceEngineskyrl-train

Replaces naive LoRA loading with a proper swap: aborts in-flight requests, removes old adapter, loads new one, resets prefix cache, and tracks _active_lora_id. This is primarily needed when LoRA fine-tuning is used
with the HTTP inference endpoint enabled — without this fix, requests going through OpenAIServingChat bypass the LoRA adapter entirely and generate from the base model, causing training to proceed on incorrect
rollouts. A monkey patch on _maybe_get_adapters ensures both the direct generate() path and the OpenAI HTTP-endpoint path use the same _active_lora_id, keeping them consistent across weight updates.


3. Improved transitions_to_training_dataskyrl-agent

Rewrote the accumulator logic for robustness:

  • Validation for None/empty observations, actions, and token lists
  • Proper handling of missing or mismatched logprobs: marks the whole datum as response_logprobs=None rather than silently padding with zeros. In agentic training, logprobs may legitimately be unavailable for some
    transitions (e.g. externally-generated actions). Padding those with 0.0 and treating them as valid would produce incorrect importance sampling ratios during off-policy correction, leading to silent training errors.
    Setting response_logprobs=None instead signals downstream that no correction should be applied for that datum.
  • Explicit length-mismatch sanity checks with detailed error messages
  • Cleaner naming and inline invariant documentation

4. Fix TrainingInputBatch construction — skyrl-train

Only adds rollout_logprobs and is_last_step to the batch dict when non-None, avoiding TensorDict wrapping None as NonTensorData. Uses isinstance(..., torch.Tensor) guard when reading rollout_logprobs back
out.


5. Fix load_checkpoints() indexing in agent trainer — skyrl-agent

load_checkpoints() returns Tuple[int, str]; added [0] to extract just global_step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions