Skip to content

fix(scorer): exclude models whose context window can't hold the turn#227

Open
steventohme wants to merge 2 commits into
mainfrom
steven/scorer-context-fit-filter
Open

fix(scorer): exclude models whose context window can't hold the turn#227
steventohme wants to merge 2 commits into
mainfrom
steven/scorer-context-fit-filter

Conversation

@steventohme
Copy link
Copy Markdown
Collaborator

Summary

Observed in prod: an 80k-token post-tool follow-up was routed to qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but empty assistant message — stop_reason=end_turn with zero content blocks after 4.5s. The router faithfully relayed exactly that, so CC saw an empty turn, considered it complete, and went silent (manifesting to the user as the conversation "just stopping" after a Read tool call).

Adds a per-model context-window declaration and a fail-open filter in the cluster scorer.

Changes

  • catalog.Model gains MaxInputTokens int (0 = unknown).
  • catalog.FitsContext(modelID, tokens) returns true unless the model's explicit window can't hold the input. When MaxInputTokens is 0, falls back to a conservative per-tier default (Low=64k, Mid=128k, High=200k) so flash-tier rows are caught even when we don't yet have a verified per-model number.
  • Scorer.Route drops eligibleModels that don't fit before argmax. If the filter would empty the pool, the unfiltered set is kept (we'd rather route to a too-small model than 503 — the empty-response failure mode is recoverable; a 503 isn't). The drop is logged at Debug; pool-empty is Warn.

Data populated

MaxInputTokens set for verified models against vendor docs on 2026-05-21:

Model family Window
Claude 4.x (haiku/sonnet/opus) 200k (base; 1M variant negotiated via [1m] suffix)
GPT-4.1 family (gpt-4.1, mini, nano) 1,047,576
GPT-4o family 128,000
Gemini 2.x family (flash/flash-lite/pro) 1,048,576
Qwen3 family (235b-a22b-2507, 30b-a3b, coder, next-80b) 262,144
Kimi K2.5 / K2.6 256,000

GPT-5/5.4/5.5, Gemini 3.x, DeepSeek V4, GLM-5, MiMo, Mistral Small 2603, and qwen/qwen3.5-flash-02-23 are left at MaxInputTokens: 0 — when a verified number lands, fill in. Until then the per-tier fallback applies (which is exactly what catches the regression: qwen/qwen3.5-flash-02-23 is TierLow → 64k fallback → 80k follow-up excluded).

Test plan

  • catalog_test.goTestFitsContext covers: zero tokens, unknown model, explicit window, per-tier fallback (with the qwen3.5-flash regression at 80k), TierUnknown passthrough.
  • cluster/scorer_test.goTestScorer_ExcludesModelsThatCannotFitContext exercises both fail-open (filter empties pool → unfiltered set kept) and normal (filter narrows pool but argmax still decisive).
  • Full test suite green: go test -tags=no_onnx ./...

Observed in prod: an 80k-token post-tool follow-up was routed to
qwen/qwen3.5-flash-02-23. The flash model returned a fully-formed but
empty assistant message — stop_reason=end_turn with zero content blocks
after 4.5s. The router faithfully relayed exactly that, so CC saw an
empty turn, considered it complete, and went silent (manifesting to the
user as the conversation 'just stopping' after a Read tool call).

Adds a per-model context-window declaration and a fail-open filter in
the cluster scorer:

- catalog.Model gains MaxInputTokens (0 = unknown).
- catalog.FitsContext(modelID, tokens) returns true unless the model's
  explicit window can't hold the input. When MaxInputTokens is 0, falls
  back to a conservative per-tier default (Low=64k, Mid=128k, High=200k)
  so flash-tier rows are caught even when we don't yet have a verified
  per-model number.
- Scorer.Route drops eligibleModels that don't fit before argmax. If the
  filter would empty the pool, the unfiltered set is kept (we'd rather
  route to a too-small model than 503 — the empty-response failure mode
  is recoverable; a 503 isn't). The drop is logged.

Populated MaxInputTokens for verified models (Claude 4.x 200k, Gemini
2.x 1M, GPT-4.1 1M, GPT-4o 128k, Qwen3-family + Kimi K2 256k). Models
with no public spec stay at 0 so the tier fallback applies.
Subagent-sourced verified context windows for the previously-unset rows.
All values cited against vendor docs / OpenRouter model pages on 2026-05-21:

- GPT-5 family:   400k (developers.openai.com/api/docs/models/gpt-5*)
- gpt-5-chat:     128k (openrouter.ai/openai/gpt-5-chat)
- GPT-5.4:        gpt-5.4/-pro 1.05M; mini/-nano 400k
- GPT-5.5:        gpt-5.5/-pro 1.05M; mini/-nano slugs not published →
                  left at 0 so TierMid fallback (128k) applies
- Gemini 3.x:     full family 1M (ai.google.dev/gemini-api/docs/gemini-3)
- DeepSeek V4:    1M (openrouter.ai/deepseek/deepseek-v4-*)
- Kimi K2.6:      262k (was 256k — bumped to match openrouter listing)
- Qwen3.6-35b:    262k base
- qwen3-coder-next: 262k
- qwen3.5-flash-02-23: 1M (this means the original empty-response was
  NOT a context-fit issue — probably a model-quality / provider problem.
  Filter still ships as defense-in-depth for genuinely small-window rows.)
- xiaomi/mimo-v2.5 + pro: 1M
- minimax/minimax-m2.7: 205k
- z-ai/glm-5: 203k
- mistral-small-2603: 262k

Test updates: dropped the qwen3.5-flash assertion (the model's real
window is 1M, so 80k tokens does fit); replaced with gpt-5.5-mini/-nano
to exercise the TierMid fallback path, which is the remaining zero-window
case after this update.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ec6b441. Configure here.

{Provider: providers.ProviderOpenRouter, Price: Pricing{InputUSDPer1M: 1.000, OutputUSDPer1M: 5.000, CacheReadMultiplier: 0.10}},
}},
{ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, Providers: []ProviderBinding{
{ID: "qwen/qwen3.5-flash-02-23", Tier: TierLow, MaxInputTokens: 1_000_000, Providers: []ProviderBinding{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regression model given 1M window, defeating the fix

High Severity

qwen/qwen3.5-flash-02-23 is set to MaxInputTokens: 1_000_000, but the PR description explicitly states this model is "left at MaxInputTokens: 0" so the TierLow fallback (64k) catches the exact 80k-token regression this PR was created to fix. With 1M explicitly set, FitsContext returns true for 80k input (80_000 <= 1_000_000), so the model remains eligible and the production failure mode (empty end_turn response) is not prevented.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ec6b441. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant