[Feature] get lora capacity info#201
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a new /capacity_info endpoint to monitor global LoRA capacity and updates the model registration logic. The review identifies a critical omission of the session_id during model registration, which is required for automatic session cleanup. It also suggests removing redundant synchronous registration logic that uses blocking Ray calls to prevent potential initialization hangs, and recommends returning Pydantic models in the client for improved type safety.
| await self.state.register_model( | ||
| run_config.model_dump(), | ||
| token=token, | ||
| model_id=adapter_name, | ||
| replica_id=self.replica_id, | ||
| ) |
There was a problem hiding this comment.
The session_id is missing from the payload passed to register_model. This will prevent the global ServerState from correctly associating the model with its owning session, which breaks the cascade cleanup logic (where models are automatically unloaded when a session expires).
payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
payload,
token=token,
model_id=adapter_name,
replica_id=self.replica_id,
)| # Initialize mixins | ||
| self._init_task_queue(TaskQueueConfig.from_dict(queue_config), deployment_name='Model') | ||
| self._init_adapter_manager(**(adapter_config or {})) | ||
| self._register_replica_at_startup() |
There was a problem hiding this comment.
Calling _register_replica_at_startup() in __init__ is redundant and potentially problematic. It performs a blocking ray.get() call which can hang the deployment initialization if the state actor is busy or not yet ready. Since the lifespan handler (added in this PR) and the lazy registration in _on_request_start already handle replica registration asynchronously, this synchronous call should be removed.
| self._register_replica_at_startup() | |
| # Note: countdown task is started lazily in _ensure_sticky() |
| def _register_replica_at_startup(self) -> None: | ||
| try: | ||
| self.state.register_replica_blocking(self.replica_id, self.max_loras) | ||
| self._replica_registered = True | ||
| except Exception as e: | ||
| logger.warning(f'Failed to register replica at startup: {e}') |
| def register_replica_blocking(self, replica_id: str, max_loras: int) -> None: | ||
| ray.get(self._actor.register_replica.remote(replica_id, max_loras)) |
| def get_capacity_info(self) -> dict: | ||
| """ | ||
| Get the server's global LoRA capacity information. | ||
|
|
||
| Returns: | ||
| dict: Containing 'max_loras', 'used_loras', and 'free_loras'. | ||
|
|
||
| Raises: | ||
| TwinkleClientError: If the request fails. | ||
| """ | ||
| from twinkle_client.types.server import CapacityInfoResponse | ||
| response = http_get(self._get_url('/capacity_info')) | ||
| data = self._handle_response(response) | ||
| return CapacityInfoResponse(**data).model_dump() |
There was a problem hiding this comment.
For consistency with other methods in the TwinkleClient, get_capacity_info should return the CapacityInfoResponse Pydantic model directly instead of a dictionary. This provides better type safety and IDE support for the caller.
| def get_capacity_info(self) -> dict: | |
| """ | |
| Get the server's global LoRA capacity information. | |
| Returns: | |
| dict: Containing 'max_loras', 'used_loras', and 'free_loras'. | |
| Raises: | |
| TwinkleClientError: If the request fails. | |
| """ | |
| from twinkle_client.types.server import CapacityInfoResponse | |
| response = http_get(self._get_url('/capacity_info')) | |
| data = self._handle_response(response) | |
| return CapacityInfoResponse(**data).model_dump() | |
| def get_capacity_info(self) -> CapacityInfoResponse: | |
| """ | |
| Get the server's global LoRA capacity information. | |
| Returns: | |
| CapacityInfoResponse: Containing 'max_loras', 'used_loras', and 'free_loras'. | |
| Raises: | |
| TwinkleClientError: If the request fails. | |
| """ | |
| from twinkle_client.types.server import CapacityInfoResponse | |
| response = http_get(self._get_url('/capacity_info')) | |
| data = self._handle_response(response) | |
| return CapacityInfoResponse(**data) |
There was a problem hiding this comment.
Pull request overview
This PR adds a server-side capacity_info API for querying global LoRA capacity (max/used/free) aggregated across registered model replicas, and wires it through both Twinkle-native and Tinker-compatible gateways as well as the Python client.
Changes:
- Add
get_capacity_info()to server state/model manager and expose it via new gateway routes (/twinkle/capacity_infoand/capacity_info). - Register model replicas with the shared
ServerStateon startup to make capacity tracking meaningful across replicas. - Add
CapacityInfoResponseto the client types and aTwinkleClient.get_capacity_info()convenience method.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/twinkle/server/utils/state/server_state.py |
Exposes capacity info via ServerState and ServerStateProxy, plus a blocking replica registration helper. |
src/twinkle/server/utils/state/model_manager.py |
Implements global capacity aggregation across registered replicas. |
src/twinkle/server/model/twinkle_handlers.py |
Updates Twinkle add-adapter flow to register/unregister models in server state for capacity tracking. |
src/twinkle/server/model/app.py |
Ensures replicas register capacity on startup (and during lifespan) to populate global capacity stats. |
src/twinkle/server/gateway/twinkle_gateway_handlers.py |
Adds Twinkle route /twinkle/capacity_info with a typed response model. |
src/twinkle/server/gateway/tinker_gateway_handlers.py |
Adds Tinker-compatible route /capacity_info returning a raw dict. |
src/twinkle_client/types/server.py |
Adds CapacityInfoResponse schema. |
src/twinkle_client/types/__init__.py |
Re-exports CapacityInfoResponse. |
src/twinkle_client/manager.py |
Adds a get_capacity_info() client helper method. |
| await self.state.register_model( | ||
| run_config.model_dump(), | ||
| token=token, | ||
| model_id=adapter_name, | ||
| replica_id=self.replica_id, | ||
| ) | ||
| try: | ||
| self.register_resource(adapter_name, token, session_id) | ||
| self.model.add_adapter_to_model(adapter_name, config, **extra_kwargs) | ||
| except Exception: | ||
| self.unregister_resource(adapter_name) | ||
| await self.state.unload_model(adapter_name) | ||
| raise |
| def get_capacity_info(self) -> dict: | ||
| """ | ||
| Get the server's global LoRA capacity information. | ||
|
|
||
| Returns: | ||
| dict: Containing 'max_loras', 'used_loras', and 'free_loras'. | ||
|
|
||
| Raises: | ||
| TwinkleClientError: If the request fails. | ||
| """ | ||
| from twinkle_client.types.server import CapacityInfoResponse | ||
| response = http_get(self._get_url('/capacity_info')) | ||
| data = self._handle_response(response) | ||
| return CapacityInfoResponse(**data).model_dump() | ||
|
|
Yunnglin
left a comment
There was a problem hiding this comment.
Thanks for the PR! The capacity_info feature looks solid overall. A few comments:
Must Fix
1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)
Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.
2. Missing session_id in register_model payload (twinkle_handlers.py)
run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry.
Suggested fix:
payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
payload,
token=token,
model_id=adapter_name,
replica_id=self.replica_id,
)Discussion
3. Is _register_replica_at_startup (blocking) necessary? (app.py)
The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.
Suggestions
4. Client get_capacity_info return type (manager.py)
For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.
5. Differentiate log messages for replica registration failures (app.py)
Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging.
PR type
PR information
Summary
Adds a
capacity_infoendpoint for querying current LoRA capacity across registered model replicas.API
Tinker-Compatible Endpoint
Twinkle Endpoint
Response format:
{ "max_loras": 5, "used_loras": 0, "free_loras": 5 }Fields:
max_loras: Total LoRA capacity across registered replicas.used_loras: Number of currently loaded LoRA adapters.free_loras: Remaining available LoRA slots.Write the detail information belongs to this PR.