Skip to content

feat(proxy): per-request provider failover for multi-binding models#197

Open
steventohme wants to merge 1 commit into
mainfrom
feat/provider-failover-v2
Open

feat(proxy): per-request provider failover for multi-binding models#197
steventohme wants to merge 1 commit into
mainfrom
feat/provider-failover-v2

Conversation

@steventohme
Copy link
Copy Markdown
Collaborator

Summary

When a model has an ordered Providers list in catalog (e.g. deepseek/deepseek-v4-pro on Fireworks primary / OpenRouter fallback), the proxy now retries on the next binding when the first attempt fails before any bytes reach the client.

The catalog data shape has supported this since #189 (unified model catalog); this PR adds the runtime dispatch.

What changes

  • providers — new UpstreamErrorResponse type for buffered error bodies, plus IsRetryableStatus (408/429/5xx) and IsRetryable(err) classifier.
  • providers/openaicompat — on >=400, buffer the response (capped at MaxBufferedErrorBytes = 64KB) and return *UpstreamErrorResponse without touching the client writer. Success path unchanged.
  • router/catalog — new AvailableBindings(id, available) []ProviderBinding returning the ordered eligible list.
  • proxy/fallback.go (new) — dispatchWithFallback walks the binding list, retries on IsRetryable errors when no bytes have flushed, and on exhaustion writes the final buffered upstream error verbatim to the client. firstByteGuard tracks whether anything has reached the client (the retry-safety invariant).
  • proxy/service.goProxyMessages and ProxyOpenAIChatCompletion dispatch through the helper. Per-attempt prep + translator construction lives in a closure so retries get fresh translator state. ProxyGemini left untouched (Gemini bindings are all single-provider).

Safety carve-outs

  • BYOK / inbound client credentials — failover skipped. Those keys bind the request to a specific upstream and would 401 elsewhere.
  • Single-binding models (Anthropic / OpenAI / Google) — len(bindings) == 1, loop runs once, identical to today.
  • Legacy "all registered" mode (deploymentKeyedProviders == nil) — also returns single-attempt to avoid retrying on providers whose keys aren't actually wired.

Retry envelope

Per scoping:

  • Transport errors from http.Client.Do
  • Buffered upstream status: 408, 429, 5xx

4xx ≠ 408/429 are the client's fault — never retried; the buffered body flushes to the client immediately so they see the real upstream error.

Observability

router.upstream OTel span gains:

  • dispatch.primary_provider
  • dispatch.final_provider
  • dispatch.fallback_attempts (winning binding index)
  • dispatch.failover_used (bool, easy alert filter)

Same fields land in the ProxyMessages / ProxyOpenAIChatCompletion completion log lines.

Response headers on successful failover: x-router-fallback-from: <primary> and x-router-fallback-attempt: <n>.

What's load-bearing

  • openaicompat no longer flushes error bodies through to the writer. This is what makes retry possible — once WriteHeader is called on the client writer, the response is committed and any retry would result in two responses on the wire. The post-loop flushBufferedIfPresent writes the final upstream's envelope when retries are exhausted, so the customer-visible error envelope on exhaustion is unchanged in shape (just sourced from the last attempt instead of the first).
  • firstByteGuard.written is the hard gate on retry. Even a retryable error type means nothing if bytes already shipped — partial SSE is on the wire and committing to that attempt is the only correct move.
  • Translator state is fresh per attempt. The dispatch closure constructs the translator inside its body so a retry doesn't reuse a stateful object that may have seen partial bytes.

Test plan

  • go build -tags=no_onnx ./... clean
  • go test -tags=no_onnx ./... — all 21 packages green
  • New tests in internal/proxy/fallback_test.go (12 cases via scripted fakeClient): transport error → retry, retryable buffered error → retry, non-retryable status → no retry, bytes-flushed → no retry, exhaustion → final buffered body flushed, single-binding passthrough, BYOK skip, catalog resolution, IsRetryable classifier table
  • New tests in internal/providers/openaicompat/client_test.go (3 cases): error body buffered + writer pristine, 64KB cap on body, 4xx still buffered (classifier decides retry, not adapter)
  • Manual smoke against a staging deploy that has both FIREWORKS_API_KEY and OPENROUTER_API_KEY wired — issue a deepseek/deepseek-v4-pro request, observe dispatch.failover_used=false baseline; then kill the Fireworks edge (or point at a 503-serving stub) and observe x-router-fallback-from: fireworks on the response

Follow-ups not in this PR

  • Tests for ProxyMessages / ProxyOpenAIChatCompletion end-to-end failover behavior (current tests target the helper + adapter; the service-level wiring is exercised but not asserted directly)
  • Provider-specific max-attempts cap if/when models grow beyond 2 bindings
  • firstByteGuard doesn't yet support http.ResponseController flush hints — fine today since the openaicompat success path uses httputil.StreamBody and the failure path doesn't write at all

When a model has an ordered Providers list in catalog (e.g. deepseek/deepseek-v4-pro
on Fireworks primary / OpenRouter fallback), the proxy now retries on the next
binding when the first attempt fails before any bytes reach the client.

- providers: new UpstreamErrorResponse type for buffered error bodies, plus
  IsRetryableStatus (408/429/5xx) and IsRetryable classifier.
- openaicompat: on >=400, buffer the response (capped at 64KB) and return
  *UpstreamErrorResponse without touching the client writer. Success path
  unchanged.
- catalog: new AvailableBindings(id, available) — ordered list of the model's
  bindings filtered to wired providers.
- proxy/fallback.go: dispatchWithFallback walks the binding list, retries on
  IsRetryable errors when no bytes have flushed, and on exhaustion writes
  the final buffered upstream error verbatim to the client.
- proxy/service.go: ProxyMessages and ProxyOpenAIChatCompletion dispatch
  through the helper. Per-attempt prep + translator construction lives in a
  closure so retries get fresh translator state.

Failover is skipped when BYOK or inbound client credentials are present —
those keys bind the request to a specific upstream and would 401 elsewhere.

OTel attrs on router.upstream span: dispatch.primary_provider,
dispatch.final_provider, dispatch.fallback_attempts, dispatch.failover_used.
Same fields land in the ProxyMessages/ProxyOpenAIChatCompletion completion
log lines.

Response headers on successful failover: x-router-fallback-from and
x-router-fallback-attempt expose the route the request actually took.

Tests: 12 new dispatchWithFallback / classifier cases via a scripted
fakeClient (transport error, retryable buffered error, non-retryable status,
bytes-flushed lockout, exhaustion flush, single-binding passthrough, BYOK
skip, catalog resolution). 3 new openaicompat cases for error buffering,
64KB cap, and 4xx buffering. Full router suite green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1c64562. Configure here.

Comment thread internal/proxy/service.go
WithRoutingMarker(routingMarkerFor(routeRes)).
WithEstimatedInputTokens(feats.Tokens)
err := p.Proxy(actx, d, prep, translator, r)
return finalizeAfterProxy(err, translator.Finalize)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prep body baked for primary provider, reused on fallback

High Severity

PrepareOpenAI is called once with opts.TargetProvider set to the primary provider, and the resulting prep body is captured by the attempt closure and reused on every fallback attempt. Provider-specific fields (provider hint, reasoning object, system reminder, tool-turn temperature override) are gated on targetIsOpenRouter(opts) at prep time. On fallback to a different provider, the body carries the wrong provider-specific fields. For current catalog (Fireworks primary → OpenRouter fallback on deepseek/), the fallback request to OpenRouter is missing its required provider/reasoning hints. If ordering ever reverses, the OpenRouter-specific fields would cause a 400 on Fireworks.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: Emit-layer provider-specific fields must gate on TargetProvider, not model slug

Reviewed by Cursor Bugbot for commit 1c64562. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant