feat(router): Add load-aware fallback to cache-aware policy #14532

ppraneth · 2025-12-06T07:09:59Z

Motivation

This PR addresses a critical stability issue in the CacheAwarePolicy routing logic where the router could unintentionally cause a "death spiral" by prioritizing memory optimization over compute availability.

The Issue:
Currently, when a request is a Cache Miss (it does not match any existing prefix), the router defaults to tree.get_smallest_tenant(). This selects the worker with the smallest cache usage to balance memory pressure. However, this logic completely ignores the current compute load of that worker.

The "Death Spiral" Scenario:
In a production environment, if a worker (e.g., Worker A) gets stuck processing a batch of unique, heavy requests:

It has a "Small Cache Tree" (because the requests are unique and don't build up a shared tree).
It has High Load (compute queue is full).
Another worker (Worker B) might be Idle but has a "Large Cache Tree" (holding a system prompt).

The current router sees "Worker A has a smaller tree" and routes 100% of new traffic to the already overloaded Worker A, causing it to crash or timeout, while Worker B remains idle.

Modifications

I modified sgl-workspace/src/policies/cache_aware.rs to implement a Load-Aware Fallback mechanism.

Updated select_worker:
- When a request is a cache miss, we now peek at the load of the "Smallest Cache" candidate.
- Added a check: if candidate.load() > min_cluster_load + 5.
- If the candidate is significantly more loaded than the cluster minimum, we override the cache-based decision and fallback to selecting the Least Loaded worker.
- This acts as a micro-circuit breaker, preventing the router from assigning work to overloaded nodes solely for memory optimization reasons.
Added Tests:
- Added test_prove_cache_aware_overload_flaw: A deterministic unit test proving the router now prefers an IDLE worker with a large cache over a BUSY worker with a small cache.
- Added test_simulation_production_traffic: A simulation of 100 requests that reproduces the "hot spot" scenario and verifies the traffic is correctly rebalanced to the idle node.

Test Results

Test 1: `test_prove_cache_aware_overload_flaw`

This test validates that the router correctly prioritizes load over cache size when the "smallest cache" worker is overloaded.

Setup:

Worker 1 (Overloaded): Load = 50, Small cache (5 chars)
Worker 2 (Idle): Load = 0, Large cache (1000 chars)

Result:

Sending Probe Request (Cache Miss)...
Selected Worker: http://idle:8000
Overloaded Load: 50
Idle Load:       0
test policies::cache_aware::tests::test_prove_cache_aware_overload_flaw ... ok

✅ Success: Router correctly selected the idle worker despite its larger cache usage.

Test 2: `test_simulation_production_traffic`

This test simulates a realistic production scenario with 100 consecutive requests to verify traffic distribution.

Setup:

Worker 0: Load = 20, Cache size = 1 char (The "Trap" - high load, small cache)
Worker 1: Load = 5, Cache size = 10 chars (Medium)
Worker 2: Load = 0, Cache size = 100 chars (Idle with large cache)

Result:

--- STARTING SIMULATION (100 Requests) ---
Initial Loads -> W0: 20, W1: 5, W2: 0

--- RESULTS ---
Final Loads   -> W0: 20, W1: 5, W2: 100
Requests Sent -> W0: 0, W1: 0, W2: 100

[SUCCESS] ROUTER IS BALANCED: It correctly prioritized the Idle node.
test policies::cache_aware::tests::test_simulation_production_traffic ... ok

✅ Success: All 100 requests were correctly routed to the idle worker (W2), avoiding the overloaded worker (W0) despite its smaller cache.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-06T07:10:03Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ppraneth added 3 commits December 6, 2025 12:10

fix

4e8bc31

fix

ca92d1b

fix

2fea9e3

ppraneth requested review from ByronHsu, CatherineSue, key4ng and slin1237 as code owners December 6, 2025 07:10

github-actions bot added the model-gateway label Dec 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(router): Add load-aware fallback to cache-aware policy #14532

feat(router): Add load-aware fallback to cache-aware policy #14532

ppraneth commented Dec 6, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(router): Add load-aware fallback to cache-aware policy #14532

Are you sure you want to change the base?

feat(router): Add load-aware fallback to cache-aware policy #14532

Conversation

ppraneth commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test Results

Test 1: test_prove_cache_aware_overload_flaw

Test 2: test_simulation_production_traffic

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ppraneth commented Dec 6, 2025 •

edited

Loading

Test 1: `test_prove_cache_aware_overload_flaw`

Test 2: `test_simulation_production_traffic`