fix(scorer): exclude models whose context window can't hold the turn#227
fix(scorer): exclude models whose context window can't hold the turn#227steventohme wants to merge 2 commits into
Conversation
Observed in prod: an 80k-token post-tool follow-up was routed to qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but empty assistant message — stop_reason=end_turn with zero content blocks after 4.5s. The router faithfully relayed exactly that, so CC saw an empty turn, considered it complete, and went silent (manifesting to the user as the conversation 'just stopping' after a Read tool call). Adds a per-model context-window declaration and a fail-open filter in the cluster scorer: - catalog.Model gains MaxInputTokens (0 = unknown). - catalog.FitsContext(modelID, tokens) returns true unless the model's explicit window can't hold the input. When MaxInputTokens is 0, falls back to a conservative per-tier default (Low=64k, Mid=128k, High=200k) so flash-tier rows are caught even when we don't yet have a verified per-model number. - Scorer.Route drops eligibleModels that don't fit before argmax. If the filter would empty the pool, the unfiltered set is kept (we'd rather route to a too-small model than 503 — the empty-response failure mode is recoverable; a 503 isn't). The drop is logged. Populated MaxInputTokens for verified models (Claude 4.x 200k, Gemini 2.x 1M, GPT-4.1 1M, GPT-4o 128k, Qwen3-family + Kimi K2 256k). Models with no public spec stay at 0 so the tier fallback applies.
Subagent-sourced verified context windows for the previously-unset rows.
All values cited against vendor docs / OpenRouter model pages on 2026-05-21:
- GPT-5 family: 400k (developers.openai.com/api/docs/models/gpt-5*)
- gpt-5-chat: 128k (openrouter.ai/openai/gpt-5-chat)
- GPT-5.4: gpt-5.4/-pro 1.05M; mini/-nano 400k
- GPT-5.5: gpt-5.5/-pro 1.05M; mini/-nano slugs not published →
left at 0 so TierMid fallback (128k) applies
- Gemini 3.x: full family 1M (ai.google.dev/gemini-api/docs/gemini-3)
- DeepSeek V4: 1M (openrouter.ai/deepseek/deepseek-v4-*)
- Kimi K2.6: 262k (was 256k — bumped to match openrouter listing)
- Qwen3.6-35b: 262k base
- qwen3-coder-next: 262k
- qwen3.5-flash-02-23: 1M (this means the original empty-response was
NOT a context-fit issue — probably a model-quality / provider problem.
Filter still ships as defense-in-depth for genuinely small-window rows.)
- xiaomi/mimo-v2.5 + pro: 1M
- minimax/minimax-m2.7: 205k
- z-ai/glm-5: 203k
- mistral-small-2603: 262k
Test updates: dropped the qwen3.5-flash assertion (the model's real
window is 1M, so 80k tokens does fit); replaced with gpt-5.5-mini/-nano
to exercise the TierMid fallback path, which is the remaining zero-window
case after this update.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ec6b441. Configure here.
| {Provider: providers.ProviderOpenRouter, Price: Pricing{InputUSDPer1M: 1.000, OutputUSDPer1M: 5.000, CacheReadMultiplier: 0.10}}, | ||
| }}, | ||
| {ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, Providers: []ProviderBinding{ | ||
| {ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, MaxInputTokens: 1_000_000, Providers: []ProviderBinding{ |
There was a problem hiding this comment.
Regression model given 1M window, defeating the fix
High Severity
qwen/qwen3.5-flash-02-23 is set to MaxInputTokens: 1_000_000, but the PR description explicitly states this model is "left at MaxInputTokens: 0" so the TierLow fallback (64k) catches the exact 80k-token regression this PR was created to fix. With 1M explicitly set, FitsContext returns true for 80k input (80_000 <= 1_000_000), so the model remains eligible and the production failure mode (empty end_turn response) is not prevented.
Reviewed by Cursor Bugbot for commit ec6b441. Configure here.


Summary
Observed in prod: an 80k-token post-tool follow-up was routed to
qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but empty assistant message —stop_reason=end_turnwith zero content blocks after 4.5s. The router faithfully relayed exactly that, so CC saw an empty turn, considered it complete, and went silent (manifesting to the user as the conversation "just stopping" after aReadtool call).Adds a per-model context-window declaration and a fail-open filter in the cluster scorer.
Changes
catalog.ModelgainsMaxInputTokens int(0 = unknown).catalog.FitsContext(modelID, tokens)returns true unless the model's explicit window can't hold the input. WhenMaxInputTokensis 0, falls back to a conservative per-tier default (Low=64k, Mid=128k, High=200k) so flash-tier rows are caught even when we don't yet have a verified per-model number.Scorer.RoutedropseligibleModelsthat don't fit before argmax. If the filter would empty the pool, the unfiltered set is kept (we'd rather route to a too-small model than 503 — the empty-response failure mode is recoverable; a 503 isn't). The drop is logged at Debug; pool-empty is Warn.Data populated
MaxInputTokensset for verified models against vendor docs on 2026-05-21:[1m]suffix)GPT-5/5.4/5.5, Gemini 3.x, DeepSeek V4, GLM-5, MiMo, Mistral Small 2603, and qwen/qwen3.5-flash-02-23 are left at
MaxInputTokens: 0— when a verified number lands, fill in. Until then the per-tier fallback applies (which is exactly what catches the regression:qwen/qwen3.5-flash-02-23isTierLow→ 64k fallback → 80k follow-up excluded).Test plan
catalog_test.go—TestFitsContextcovers: zero tokens, unknown model, explicit window, per-tier fallback (with the qwen3.5-flash regression at 80k), TierUnknown passthrough.cluster/scorer_test.go—TestScorer_ExcludesModelsThatCannotFitContextexercises both fail-open (filter empties pool → unfiltered set kept) and normal (filter narrows pool but argmax still decisive).go test -tags=no_onnx ./...