Skip to content

[Feature] get lora capacity info#201

Open
kevssim wants to merge 2 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot
Open

[Feature] get lora capacity info#201
kevssim wants to merge 2 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot

Conversation

@kevssim
Copy link
Copy Markdown
Collaborator

@kevssim kevssim commented May 22, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Summary

Adds a capacity_info endpoint for querying current LoRA capacity across registered model replicas.

API

Tinker-Compatible Endpoint

GET /api/v1/capacity_info

Twinkle Endpoint

GET /api/v1/twinkle/capacity_info

Response format:

{
  "max_loras": 5,
  "used_loras": 0,
  "free_loras": 5
}

Fields:

  • max_loras: Total LoRA capacity across registered replicas.
  • used_loras: Number of currently loaded LoRA adapters.
  • free_loras: Remaining available LoRA slots.

Write the detail information belongs to this PR.

@kevssim kevssim requested a review from Yunnglin May 22, 2026 06:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new /capacity_info endpoint to monitor global LoRA capacity and updates the model registration logic. The review identifies a critical omission of the session_id during model registration, which is required for automatic session cleanup. It also suggests removing redundant synchronous registration logic that uses blocking Ray calls to prevent potential initialization hangs, and recommends returning Pydantic models in the client for improved type safety.

Comment on lines +508 to +513
await self.state.register_model(
run_config.model_dump(),
token=token,
model_id=adapter_name,
replica_id=self.replica_id,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The session_id is missing from the payload passed to register_model. This will prevent the global ServerState from correctly associating the model with its owning session, which breaks the cascade cleanup logic (where models are automatically unloaded when a session expires).

            payload = run_config.model_dump()
            payload['session_id'] = session_id
            await self.state.register_model(
                payload,
                token=token,
                model_id=adapter_name,
                replica_id=self.replica_id,
            )

# Initialize mixins
self._init_task_queue(TaskQueueConfig.from_dict(queue_config), deployment_name='Model')
self._init_adapter_manager(**(adapter_config or {}))
self._register_replica_at_startup()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling _register_replica_at_startup() in __init__ is redundant and potentially problematic. It performs a blocking ray.get() call which can hang the deployment initialization if the state actor is busy or not yet ready. Since the lifespan handler (added in this PR) and the lazy registration in _on_request_start already handle replica registration asynchronously, this synchronous call should be removed.

Suggested change
self._register_replica_at_startup()
# Note: countdown task is started lazily in _ensure_sticky()

Comment on lines +91 to +96
def _register_replica_at_startup(self) -> None:
try:
self.state.register_replica_blocking(self.replica_id, self.max_loras)
self._replica_registered = True
except Exception as e:
logger.warning(f'Failed to register replica at startup: {e}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This method is no longer needed if the call in __init__ is removed. Registration is already handled asynchronously via the lifespan event and lazy-loading logic.

Comment on lines +414 to +415
def register_replica_blocking(self, replica_id: str, max_loras: int) -> None:
ray.get(self._actor.register_replica.remote(replica_id, max_loras))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This blocking method was added to support registration during __init__. If the synchronous registration in ModelManagement.__init__ is removed as suggested, this method should also be removed to avoid encouraging blocking Ray calls.

Comment on lines +79 to +92
def get_capacity_info(self) -> dict:
"""
Get the server's global LoRA capacity information.

Returns:
dict: Containing 'max_loras', 'used_loras', and 'free_loras'.

Raises:
TwinkleClientError: If the request fails.
"""
from twinkle_client.types.server import CapacityInfoResponse
response = http_get(self._get_url('/capacity_info'))
data = self._handle_response(response)
return CapacityInfoResponse(**data).model_dump()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other methods in the TwinkleClient, get_capacity_info should return the CapacityInfoResponse Pydantic model directly instead of a dictionary. This provides better type safety and IDE support for the caller.

Suggested change
def get_capacity_info(self) -> dict:
"""
Get the server's global LoRA capacity information.
Returns:
dict: Containing 'max_loras', 'used_loras', and 'free_loras'.
Raises:
TwinkleClientError: If the request fails.
"""
from twinkle_client.types.server import CapacityInfoResponse
response = http_get(self._get_url('/capacity_info'))
data = self._handle_response(response)
return CapacityInfoResponse(**data).model_dump()
def get_capacity_info(self) -> CapacityInfoResponse:
"""
Get the server's global LoRA capacity information.
Returns:
CapacityInfoResponse: Containing 'max_loras', 'used_loras', and 'free_loras'.
Raises:
TwinkleClientError: If the request fails.
"""
from twinkle_client.types.server import CapacityInfoResponse
response = http_get(self._get_url('/capacity_info'))
data = self._handle_response(response)
return CapacityInfoResponse(**data)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a server-side capacity_info API for querying global LoRA capacity (max/used/free) aggregated across registered model replicas, and wires it through both Twinkle-native and Tinker-compatible gateways as well as the Python client.

Changes:

  • Add get_capacity_info() to server state/model manager and expose it via new gateway routes (/twinkle/capacity_info and /capacity_info).
  • Register model replicas with the shared ServerState on startup to make capacity tracking meaningful across replicas.
  • Add CapacityInfoResponse to the client types and a TwinkleClient.get_capacity_info() convenience method.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/twinkle/server/utils/state/server_state.py Exposes capacity info via ServerState and ServerStateProxy, plus a blocking replica registration helper.
src/twinkle/server/utils/state/model_manager.py Implements global capacity aggregation across registered replicas.
src/twinkle/server/model/twinkle_handlers.py Updates Twinkle add-adapter flow to register/unregister models in server state for capacity tracking.
src/twinkle/server/model/app.py Ensures replicas register capacity on startup (and during lifespan) to populate global capacity stats.
src/twinkle/server/gateway/twinkle_gateway_handlers.py Adds Twinkle route /twinkle/capacity_info with a typed response model.
src/twinkle/server/gateway/tinker_gateway_handlers.py Adds Tinker-compatible route /capacity_info returning a raw dict.
src/twinkle_client/types/server.py Adds CapacityInfoResponse schema.
src/twinkle_client/types/__init__.py Re-exports CapacityInfoResponse.
src/twinkle_client/manager.py Adds a get_capacity_info() client helper method.

Comment on lines +508 to +520
await self.state.register_model(
run_config.model_dump(),
token=token,
model_id=adapter_name,
replica_id=self.replica_id,
)
try:
self.register_resource(adapter_name, token, session_id)
self.model.add_adapter_to_model(adapter_name, config, **extra_kwargs)
except Exception:
self.unregister_resource(adapter_name)
await self.state.unload_model(adapter_name)
raise
Comment on lines +79 to +93
def get_capacity_info(self) -> dict:
"""
Get the server's global LoRA capacity information.

Returns:
dict: Containing 'max_loras', 'used_loras', and 'free_loras'.

Raises:
TwinkleClientError: If the request fails.
"""
from twinkle_client.types.server import CapacityInfoResponse
response = http_get(self._get_url('/capacity_info'))
data = self._handle_response(response)
return CapacityInfoResponse(**data).model_dump()

Copy link
Copy Markdown
Collaborator

@Yunnglin Yunnglin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:


Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.

2. Missing session_id in register_model payload (twinkle_handlers.py)

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry.

Suggested fix:

payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)

Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.


Suggestions

4. Client get_capacity_info return type (manager.py)

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.

5. Differentiate log messages for replica registration failures (app.py)

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants