feat(router): Add load-aware fallback to cache-aware policy #14532
+202
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This PR addresses a critical stability issue in the
CacheAwarePolicyrouting logic where the router could unintentionally cause a "death spiral" by prioritizing memory optimization over compute availability.The Issue:
Currently, when a request is a Cache Miss (it does not match any existing prefix), the router defaults to
tree.get_smallest_tenant(). This selects the worker with the smallest cache usage to balance memory pressure. However, this logic completely ignores the current compute load of that worker.The "Death Spiral" Scenario:
In a production environment, if a worker (e.g., Worker A) gets stuck processing a batch of unique, heavy requests:
The current router sees "Worker A has a smaller tree" and routes 100% of new traffic to the already overloaded Worker A, causing it to crash or timeout, while Worker B remains idle.
Modifications
I modified
sgl-workspace/src/policies/cache_aware.rsto implement a Load-Aware Fallback mechanism.Updated
select_worker:if candidate.load() > min_cluster_load + 5.Added Tests:
test_prove_cache_aware_overload_flaw: A deterministic unit test proving the router now prefers an IDLE worker with a large cache over a BUSY worker with a small cache.test_simulation_production_traffic: A simulation of 100 requests that reproduces the "hot spot" scenario and verifies the traffic is correctly rebalanced to the idle node.Test Results
Test 1:
test_prove_cache_aware_overload_flawThis test validates that the router correctly prioritizes load over cache size when the "smallest cache" worker is overloaded.
Setup:
Result:
✅ Success: Router correctly selected the idle worker despite its larger cache usage.
Test 2:
test_simulation_production_trafficThis test simulates a realistic production scenario with 100 consecutive requests to verify traffic distribution.
Setup:
Result:
✅ Success: All 100 requests were correctly routed to the idle worker (W2), avoiding the overloaded worker (W0) despite its smaller cache.
Benchmarking and Profiling
Checklist