fix(scorer): exclude models whose context window can't hold the turn by steventohme · Pull Request #227 · workweave/router

steventohme · 2026-05-21T23:55:57Z

Summary

Observed in prod: an 80k-token post-tool follow-up was routed to qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but empty assistant message — stop_reason=end_turn with zero content blocks after 4.5s. The router faithfully relayed exactly that, so CC saw an empty turn, considered it complete, and went silent (manifesting to the user as the conversation "just stopping" after a Read tool call).

Adds a per-model context-window declaration and a fail-open filter in the cluster scorer.

Changes

catalog.Model gains MaxInputTokens int (0 = unknown).
catalog.FitsContext(modelID, tokens) returns true unless the model's explicit window can't hold the input. When MaxInputTokens is 0, falls back to a conservative per-tier default (Low=64k, Mid=128k, High=200k) so flash-tier rows are caught even when we don't yet have a verified per-model number.
Scorer.Route drops eligibleModels that don't fit before argmax. If the filter would empty the pool, the unfiltered set is kept (we'd rather route to a too-small model than 503 — the empty-response failure mode is recoverable; a 503 isn't). The drop is logged at Debug; pool-empty is Warn.

Data populated

MaxInputTokens set for verified models against vendor docs on 2026-05-21:

Model family	Window
Claude 4.x (haiku/sonnet/opus)	200k (base; 1M variant negotiated via `[1m]` suffix)
GPT-4.1 family (gpt-4.1, mini, nano)	1,047,576
GPT-4o family	128,000
Gemini 2.x family (flash/flash-lite/pro)	1,048,576
Qwen3 family (235b-a22b-2507, 30b-a3b, coder, next-80b)	262,144
Kimi K2.5 / K2.6	256,000

GPT-5/5.4/5.5, Gemini 3.x, DeepSeek V4, GLM-5, MiMo, Mistral Small 2603, and qwen/qwen3.5-flash-02-23 are left at MaxInputTokens: 0 — when a verified number lands, fill in. Until then the per-tier fallback applies (which is exactly what catches the regression: qwen/qwen3.5-flash-02-23 is TierLow → 64k fallback → 80k follow-up excluded).

Test plan

catalog_test.go — TestFitsContext covers: zero tokens, unknown model, explicit window, per-tier fallback (with the qwen3.5-flash regression at 80k), TierUnknown passthrough.
cluster/scorer_test.go — TestScorer_ExcludesModelsThatCannotFitContext exercises both fail-open (filter empties pool → unfiltered set kept) and normal (filter narrows pool but argmax still decisive).
Full test suite green: go test -tags=no_onnx ./...

Observed in prod: an 80k-token post-tool follow-up was routed to qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but empty assistant message — stop_reason=end_turn with zero content blocks after 4.5s. The router faithfully relayed exactly that, so CC saw an empty turn, considered it complete, and went silent (manifesting to the user as the conversation 'just stopping' after a Read tool call). Adds a per-model context-window declaration and a fail-open filter in the cluster scorer: - catalog.Model gains MaxInputTokens (0 = unknown). - catalog.FitsContext(modelID, tokens) returns true unless the model's explicit window can't hold the input. When MaxInputTokens is 0, falls back to a conservative per-tier default (Low=64k, Mid=128k, High=200k) so flash-tier rows are caught even when we don't yet have a verified per-model number. - Scorer.Route drops eligibleModels that don't fit before argmax. If the filter would empty the pool, the unfiltered set is kept (we'd rather route to a too-small model than 503 — the empty-response failure mode is recoverable; a 503 isn't). The drop is logged. Populated MaxInputTokens for verified models (Claude 4.x 200k, Gemini 2.x 1M, GPT-4.1 1M, GPT-4o 128k, Qwen3-family + Kimi K2 256k). Models with no public spec stay at 0 so the tier fallback applies.

Subagent-sourced verified context windows for the previously-unset rows. All values cited against vendor docs / OpenRouter model pages on 2026-05-21: - GPT-5 family: 400k (developers.openai.com/api/docs/models/gpt-5*) - gpt-5-chat: 128k (openrouter.ai/openai/gpt-5-chat) - GPT-5.4: gpt-5.4/-pro 1.05M; mini/-nano 400k - GPT-5.5: gpt-5.5/-pro 1.05M; mini/-nano slugs not published → left at 0 so TierMid fallback (128k) applies - Gemini 3.x: full family 1M (ai.google.dev/gemini-api/docs/gemini-3) - DeepSeek V4: 1M (openrouter.ai/deepseek/deepseek-v4-*) - Kimi K2.6: 262k (was 256k — bumped to match openrouter listing) - Qwen3.6-35b: 262k base - qwen3-coder-next: 262k - qwen3.5-flash-02-23: 1M (this means the original empty-response was NOT a context-fit issue — probably a model-quality / provider problem. Filter still ships as defense-in-depth for genuinely small-window rows.) - xiaomi/mimo-v2.5 + pro: 1M - minimax/minimax-m2.7: 205k - z-ai/glm-5: 203k - mistral-small-2603: 262k Test updates: dropped the qwen3.5-flash assertion (the model's real window is 1M, so 80k tokens does fit); replaced with gpt-5.5-mini/-nano to exercise the TierMid fallback path, which is the remaining zero-window case after this update.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ec6b441. Configure here.}

cursor · 2026-05-22T00:49:46Z

 		{Provider: providers.ProviderOpenRouter, Price: Pricing{InputUSDPer1M: 1.000, OutputUSDPer1M: 5.000, CacheReadMultiplier: 0.10}},
 	}},
-	{ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, Providers: []ProviderBinding{
+	{ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, MaxInputTokens: 1_000_000, Providers: []ProviderBinding{


Regression model given 1M window, defeating the fix

High Severity

qwen/qwen3.5-flash-02-23 is set to MaxInputTokens: 1_000_000, but the PR description explicitly states this model is "left at MaxInputTokens: 0" so the TierLow fallback (64k) catches the exact 80k-token regression this PR was created to fix. With 1M explicitly set, FitsContext returns true for 80k input (80_000 <= 1_000_000), so the model remains eligible and the production failure mode (empty end_turn response) is not prevented.

^{Reviewed by Cursor Bugbot for commit ec6b441. Configure here.}

steventohme added 2 commits May 21, 2026 16:55

cursor Bot reviewed May 22, 2026

View reviewed changes

steventohme mentioned this pull request May 22, 2026

feat(diagnostics): detect empty-turn upstream responses + repro harness #228

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scorer): exclude models whose context window can't hold the turn#227

fix(scorer): exclude models whose context window can't hold the turn#227
steventohme wants to merge 2 commits into
mainfrom
steven/scorer-context-fit-filter

steventohme commented May 21, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steventohme commented May 21, 2026

Summary

Changes

Data populated

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 22, 2026

Choose a reason for hiding this comment

Regression model given 1M window, defeating the fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant