Overview
Tracking a set of changes across skyrl-train and skyrl-agent that will be submitted as a PR. Summarizing here for visibility and early feedback.
1. Anthropic Messages API endpoint (/v1/messages) — skyrl-train
Adds a /v1/messages endpoint compatible with the Anthropic Messages API across the full inference engine stack, allowing agents using the Claude SDK (or any Anthropic-compatible client) to talk directly to a
vLLM/SGLang backend served by SkyRL's inference engine server.
InferenceEngineInterface: new anthropic_messages() abstract method
InferenceEngineClient: routes requests using session-based sticky routing (same logic as chat_completion)
- HTTP endpoint (
inference_engine_client_http_endpoint.py): FastAPI POST /v1/messages with input validation and proper HTTP status mapping for Anthropic error types
RemoteInferenceEngine / RayWrappedInferenceEngine: forwarding implementations
AsyncVLLMInferenceEngine: full implementation — converts Anthropic → OpenAI chat format, calls chat_completion, converts response back (including stop_reason, usage, content blocks)
- SGLang and sync vLLM: raise
NotImplementedError with TODOs
2. Improved LoRA weight swap in AsyncVLLMInferenceEngine — skyrl-train
Replaces naive LoRA loading with a proper swap: aborts in-flight requests, removes old adapter, loads new one, resets prefix cache, and tracks _active_lora_id. This is primarily needed when LoRA fine-tuning is used
with the HTTP inference endpoint enabled — without this fix, requests going through OpenAIServingChat bypass the LoRA adapter entirely and generate from the base model, causing training to proceed on incorrect
rollouts. A monkey patch on _maybe_get_adapters ensures both the direct generate() path and the OpenAI HTTP-endpoint path use the same _active_lora_id, keeping them consistent across weight updates.
3. Improved transitions_to_training_data — skyrl-agent
Rewrote the accumulator logic for robustness:
- Validation for None/empty observations, actions, and token lists
- Proper handling of missing or mismatched logprobs: marks the whole datum as
response_logprobs=None rather than silently padding with zeros. In agentic training, logprobs may legitimately be unavailable for some
transitions (e.g. externally-generated actions). Padding those with 0.0 and treating them as valid would produce incorrect importance sampling ratios during off-policy correction, leading to silent training errors.
Setting response_logprobs=None instead signals downstream that no correction should be applied for that datum.
- Explicit length-mismatch sanity checks with detailed error messages
- Cleaner naming and inline invariant documentation
4. Fix TrainingInputBatch construction — skyrl-train
Only adds rollout_logprobs and is_last_step to the batch dict when non-None, avoiding TensorDict wrapping None as NonTensorData. Uses isinstance(..., torch.Tensor) guard when reading rollout_logprobs back
out.
5. Fix load_checkpoints() indexing in agent trainer — skyrl-agent
load_checkpoints() returns Tuple[int, str]; added [0] to extract just global_step.
Overview
Tracking a set of changes across
skyrl-trainandskyrl-agentthat will be submitted as a PR. Summarizing here for visibility and early feedback.1. Anthropic Messages API endpoint (
/v1/messages) —skyrl-trainAdds a
/v1/messagesendpoint compatible with the Anthropic Messages API across the full inference engine stack, allowing agents using the Claude SDK (or any Anthropic-compatible client) to talk directly to avLLM/SGLang backend served by SkyRL's inference engine server.
InferenceEngineInterface: newanthropic_messages()abstract methodInferenceEngineClient: routes requests using session-based sticky routing (same logic aschat_completion)inference_engine_client_http_endpoint.py): FastAPIPOST /v1/messageswith input validation and proper HTTP status mapping for Anthropic error typesRemoteInferenceEngine/RayWrappedInferenceEngine: forwarding implementationsAsyncVLLMInferenceEngine: full implementation — converts Anthropic → OpenAI chat format, callschat_completion, converts response back (includingstop_reason,usage, content blocks)NotImplementedErrorwith TODOs2. Improved LoRA weight swap in
AsyncVLLMInferenceEngine—skyrl-trainReplaces naive LoRA loading with a proper swap: aborts in-flight requests, removes old adapter, loads new one, resets prefix cache, and tracks
_active_lora_id. This is primarily needed when LoRA fine-tuning is usedwith the HTTP inference endpoint enabled — without this fix, requests going through
OpenAIServingChatbypass the LoRA adapter entirely and generate from the base model, causing training to proceed on incorrectrollouts. A monkey patch on
_maybe_get_adaptersensures both the directgenerate()path and the OpenAI HTTP-endpoint path use the same_active_lora_id, keeping them consistent across weight updates.3. Improved
transitions_to_training_data—skyrl-agentRewrote the accumulator logic for robustness:
response_logprobs=Nonerather than silently padding with zeros. In agentic training, logprobs may legitimately be unavailable for sometransitions (e.g. externally-generated actions). Padding those with
0.0and treating them as valid would produce incorrect importance sampling ratios during off-policy correction, leading to silent training errors.Setting
response_logprobs=Noneinstead signals downstream that no correction should be applied for that datum.4. Fix
TrainingInputBatchconstruction —skyrl-trainOnly adds
rollout_logprobsandis_last_stepto the batch dict when non-None, avoiding TensorDict wrappingNoneasNonTensorData. Usesisinstance(..., torch.Tensor)guard when readingrollout_logprobsbackout.
5. Fix
load_checkpoints()indexing in agent trainer —skyrl-agentload_checkpoints()returnsTuple[int, str]; added[0]to extract justglobal_step.