[Feature] get lora capacity info by kevssim · Pull Request #201 · modelscope/twinkle

kevssim · 2026-05-22T06:40:17Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Summary

Adds a capacity_info endpoint for querying current LoRA capacity across registered model replicas.

API

Tinker-Compatible Endpoint

GET /api/v1/capacity_info

Twinkle Endpoint

GET /api/v1/twinkle/capacity_info

Response format:

{
  "max_loras": 5,
  "used_loras": 0,
  "free_loras": 5
}

Fields:

max_loras: Total LoRA capacity across registered replicas.
used_loras: Number of currently loaded LoRA adapters.
free_loras: Remaining available LoRA slots.

Write the detail information belongs to this PR.

gemini-code-assist

Code Review

This pull request implements a new /capacity_info endpoint to monitor global LoRA capacity and updates the model registration logic. The review identifies a critical omission of the session_id during model registration, which is required for automatic session cleanup. It also suggests removing redundant synchronous registration logic that uses blocking Ray calls to prevent potential initialization hangs, and recommends returning Pydantic models in the client for improved type safety.

gemini-code-assist · 2026-05-22T06:42:30Z

+            await self.state.register_model(
+                run_config.model_dump(),
+                token=token,
+                model_id=adapter_name,
+                replica_id=self.replica_id,
+            )


The session_id is missing from the payload passed to register_model. This will prevent the global ServerState from correctly associating the model with its owning session, which breaks the cascade cleanup logic (where models are automatically unloaded when a session expires).

payload = run_config.model_dump() payload['session_id'] = session_id await self.state.register_model( payload, token=token, model_id=adapter_name, replica_id=self.replica_id, )

gemini-code-assist · 2026-05-22T06:42:31Z

        # Initialize mixins
        self._init_task_queue(TaskQueueConfig.from_dict(queue_config), deployment_name='Model')
        self._init_adapter_manager(**(adapter_config or {}))
+        self._register_replica_at_startup()


Calling _register_replica_at_startup() in __init__ is redundant and potentially problematic. It performs a blocking ray.get() call which can hang the deployment initialization if the state actor is busy or not yet ready. Since the lifespan handler (added in this PR) and the lazy registration in _on_request_start already handle replica registration asynchronously, this synchronous call should be removed.

Suggested change

self._register_replica_at_startup()

# Note: countdown task is started lazily in _ensure_sticky()

gemini-code-assist · 2026-05-22T06:42:31Z

+    def _register_replica_at_startup(self) -> None:
+        try:
+            self.state.register_replica_blocking(self.replica_id, self.max_loras)
+            self._replica_registered = True
+        except Exception as e:
+            logger.warning(f'Failed to register replica at startup: {e}')


This method is no longer needed if the call in __init__ is removed. Registration is already handled asynchronously via the lifespan event and lazy-loading logic.

gemini-code-assist · 2026-05-22T06:42:31Z

+    def register_replica_blocking(self, replica_id: str, max_loras: int) -> None:
+        ray.get(self._actor.register_replica.remote(replica_id, max_loras))


This blocking method was added to support registration during __init__. If the synchronous registration in ModelManagement.__init__ is removed as suggested, this method should also be removed to avoid encouraging blocking Ray calls.

gemini-code-assist · 2026-05-22T06:42:31Z

+    def get_capacity_info(self) -> dict:
+        """
+        Get the server's global LoRA capacity information.
+
+        Returns:
+            dict: Containing 'max_loras', 'used_loras', and 'free_loras'.
+
+        Raises:
+            TwinkleClientError: If the request fails.
+        """
+        from twinkle_client.types.server import CapacityInfoResponse
+        response = http_get(self._get_url('/capacity_info'))
+        data = self._handle_response(response)
+        return CapacityInfoResponse(**data).model_dump()


For consistency with other methods in the TwinkleClient, get_capacity_info should return the CapacityInfoResponse Pydantic model directly instead of a dictionary. This provides better type safety and IDE support for the caller.

Suggested change

def get_capacity_info(self) -> dict:

"""

Get the server's global LoRA capacity information.

Returns:

dict: Containing 'max_loras', 'used_loras', and 'free_loras'.

Raises:

TwinkleClientError: If the request fails.

"""

from twinkle_client.types.server import CapacityInfoResponse

response = http_get(self._get_url('/capacity_info'))

data = self._handle_response(response)

return CapacityInfoResponse(**data).model_dump()

def get_capacity_info(self) -> CapacityInfoResponse:

"""

Get the server's global LoRA capacity information.

Returns:

CapacityInfoResponse: Containing 'max_loras', 'used_loras', and 'free_loras'.

Raises:

TwinkleClientError: If the request fails.

"""

from twinkle_client.types.server import CapacityInfoResponse

response = http_get(self._get_url('/capacity_info'))

data = self._handle_response(response)

return CapacityInfoResponse(**data)

Copilot

Pull request overview

This PR adds a server-side capacity_info API for querying global LoRA capacity (max/used/free) aggregated across registered model replicas, and wires it through both Twinkle-native and Tinker-compatible gateways as well as the Python client.

Changes:

Add get_capacity_info() to server state/model manager and expose it via new gateway routes (/twinkle/capacity_info and /capacity_info).
Register model replicas with the shared ServerState on startup to make capacity tracking meaningful across replicas.
Add CapacityInfoResponse to the client types and a TwinkleClient.get_capacity_info() convenience method.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/twinkle/server/utils/state/server_state.py`	Exposes capacity info via `ServerState` and `ServerStateProxy`, plus a blocking replica registration helper.
`src/twinkle/server/utils/state/model_manager.py`	Implements global capacity aggregation across registered replicas.
`src/twinkle/server/model/twinkle_handlers.py`	Updates Twinkle add-adapter flow to register/unregister models in server state for capacity tracking.
`src/twinkle/server/model/app.py`	Ensures replicas register capacity on startup (and during lifespan) to populate global capacity stats.
`src/twinkle/server/gateway/twinkle_gateway_handlers.py`	Adds Twinkle route `/twinkle/capacity_info` with a typed response model.
`src/twinkle/server/gateway/tinker_gateway_handlers.py`	Adds Tinker-compatible route `/capacity_info` returning a raw dict.
`src/twinkle_client/types/server.py`	Adds `CapacityInfoResponse` schema.
`src/twinkle_client/types/__init__.py`	Re-exports `CapacityInfoResponse`.
`src/twinkle_client/manager.py`	Adds a `get_capacity_info()` client helper method.

+            await self.state.register_model(
+                run_config.model_dump(),
+                token=token,
+                model_id=adapter_name,
+                replica_id=self.replica_id,
+            )
+            try:
+                self.register_resource(adapter_name, token, session_id)
+                self.model.add_adapter_to_model(adapter_name, config, **extra_kwargs)
+            except Exception:
+                self.unregister_resource(adapter_name)
+                await self.state.unload_model(adapter_name)
+                raise


+    def get_capacity_info(self) -> dict:
+        """
+        Get the server's global LoRA capacity information.
+
+        Returns:
+            dict: Containing 'max_loras', 'used_loras', and 'free_loras'.
+
+        Raises:
+            TwinkleClientError: If the request fails.
+        """
+        from twinkle_client.types.server import CapacityInfoResponse
+        response = http_get(self._get_url('/capacity_info'))
+        data = self._handle_response(response)
+        return CapacityInfoResponse(**data).model_dump()
+


Yunnglin

Thanks for the PR! The capacity_info feature looks solid overall. A few comments:

Must Fix

1. Remove the Tinker gateway /capacity_info endpoint (tinker_gateway_handlers.py)

Tinker SDK does not have a corresponding client method for capacity_info, and there's no plan to add one. This endpoint should only be exposed via the Twinkle gateway (/twinkle/capacity_info). Please remove the Tinker-compatible route.

2. Missing session_id in register_model payload (twinkle_handlers.py)

run_config.model_dump() does not include session_id. As a result, ModelRecord.session_id will be None, and when the session expires, cleanup_expired won't cascade-remove the model from _replica_models. This means capacity_info.used_loras will only increase and never decrease on session expiry.

Suggested fix:

payload = run_config.model_dump()
payload['session_id'] = session_id
await self.state.register_model(
    payload,
    token=token,
    model_id=adapter_name,
    replica_id=self.replica_id,
)

Discussion

3. Is _register_replica_at_startup (blocking) necessary? (app.py)

The lifespan handler already calls _ensure_replica_registered() asynchronously, and _on_request_start also has lazy registration. Adding a blocking ray.get() in __init__ introduces a potential hang risk if the state actor isn't ready. Could we just rely on the lifespan + lazy path and remove _register_replica_at_startup along with register_replica_blocking? Would like to hear your thoughts on the trade-off here.

Suggestions

4. Client get_capacity_info return type (manager.py)

For consistency with other client methods, consider returning CapacityInfoResponse directly instead of dict. Also move the import to the top of the file.

5. Differentiate log messages for replica registration failures (app.py)

Both _register_replica_at_startup and the lifespan handler log the same message. Consider making them distinct for easier debugging.

kevssim added 2 commits May 22, 2026 11:08

wip

7164772

Fix capacity info cold start registration

cb6b88f

kevssim requested a review from Yunnglin May 22, 2026 06:40

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Yunnglin requested a review from Copilot May 26, 2026 03:30

Copilot started reviewing on behalf of Yunnglin May 26, 2026 03:30 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

Yunnglin reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] get lora capacity info#201

[Feature] get lora capacity info#201
kevssim wants to merge 2 commits into
modelscope:mainfrom
kevssim:get_num_lora_slot

kevssim commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Yunnglin left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	self._register_replica_at_startup()
	# Note: countdown task is started lazily in _ensure_sticky()

		def register_replica_blocking(self, replica_id: str, max_loras: int) -> None:
		ray.get(self._actor.register_replica.remote(replica_id, max_loras))

Conversation

kevssim commented May 22, 2026

PR type

PR information

Summary

API

Tinker-Compatible Endpoint

Twinkle Endpoint

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Yunnglin left a comment

Choose a reason for hiding this comment

Must Fix

Discussion

Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants