Skip to content

Conversation

@ppraneth
Copy link
Contributor

@ppraneth ppraneth commented Dec 6, 2025

Motivation

This PR addresses a critical stability issue in the CacheAwarePolicy routing logic where the router could unintentionally cause a "death spiral" by prioritizing memory optimization over compute availability.

The Issue:
Currently, when a request is a Cache Miss (it does not match any existing prefix), the router defaults to tree.get_smallest_tenant(). This selects the worker with the smallest cache usage to balance memory pressure. However, this logic completely ignores the current compute load of that worker.

The "Death Spiral" Scenario:
In a production environment, if a worker (e.g., Worker A) gets stuck processing a batch of unique, heavy requests:

  1. It has a "Small Cache Tree" (because the requests are unique and don't build up a shared tree).
  2. It has High Load (compute queue is full).
  3. Another worker (Worker B) might be Idle but has a "Large Cache Tree" (holding a system prompt).

The current router sees "Worker A has a smaller tree" and routes 100% of new traffic to the already overloaded Worker A, causing it to crash or timeout, while Worker B remains idle.

Modifications

I modified sgl-workspace/src/policies/cache_aware.rs to implement a Load-Aware Fallback mechanism.

  • Updated select_worker:

    • When a request is a cache miss, we now peek at the load of the "Smallest Cache" candidate.
    • Added a check: if candidate.load() > min_cluster_load + 5.
    • If the candidate is significantly more loaded than the cluster minimum, we override the cache-based decision and fallback to selecting the Least Loaded worker.
    • This acts as a micro-circuit breaker, preventing the router from assigning work to overloaded nodes solely for memory optimization reasons.
  • Added Tests:

    • Added test_prove_cache_aware_overload_flaw: A deterministic unit test proving the router now prefers an IDLE worker with a large cache over a BUSY worker with a small cache.
    • Added test_simulation_production_traffic: A simulation of 100 requests that reproduces the "hot spot" scenario and verifies the traffic is correctly rebalanced to the idle node.

Test Results

Test 1: test_prove_cache_aware_overload_flaw

This test validates that the router correctly prioritizes load over cache size when the "smallest cache" worker is overloaded.

Setup:

  • Worker 1 (Overloaded): Load = 50, Small cache (5 chars)
  • Worker 2 (Idle): Load = 0, Large cache (1000 chars)

Result:

Sending Probe Request (Cache Miss)...
Selected Worker: http://idle:8000
Overloaded Load: 50
Idle Load:       0
test policies::cache_aware::tests::test_prove_cache_aware_overload_flaw ... ok

Success: Router correctly selected the idle worker despite its larger cache usage.

Test 2: test_simulation_production_traffic

This test simulates a realistic production scenario with 100 consecutive requests to verify traffic distribution.

Setup:

  • Worker 0: Load = 20, Cache size = 1 char (The "Trap" - high load, small cache)
  • Worker 1: Load = 5, Cache size = 10 chars (Medium)
  • Worker 2: Load = 0, Cache size = 100 chars (Idle with large cache)

Result:

--- STARTING SIMULATION (100 Requests) ---
Initial Loads -> W0: 20, W1: 5, W2: 0

--- RESULTS ---
Final Loads   -> W0: 20, W1: 5, W2: 100
Requests Sent -> W0: 0, W1: 0, W2: 100

[SUCCESS] ROUTER IS BALANCED: It correctly prioritized the Idle node.
test policies::cache_aware::tests::test_simulation_production_traffic ... ok

Success: All 100 requests were correctly routed to the idle worker (W2), avoiding the overloaded worker (W0) despite its smaller cache.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant