feat(proxy): per-request provider failover for multi-binding models#197
feat(proxy): per-request provider failover for multi-binding models#197steventohme wants to merge 1 commit into
Conversation
When a model has an ordered Providers list in catalog (e.g. deepseek/deepseek-v4-pro on Fireworks primary / OpenRouter fallback), the proxy now retries on the next binding when the first attempt fails before any bytes reach the client. - providers: new UpstreamErrorResponse type for buffered error bodies, plus IsRetryableStatus (408/429/5xx) and IsRetryable classifier. - openaicompat: on >=400, buffer the response (capped at 64KB) and return *UpstreamErrorResponse without touching the client writer. Success path unchanged. - catalog: new AvailableBindings(id, available) — ordered list of the model's bindings filtered to wired providers. - proxy/fallback.go: dispatchWithFallback walks the binding list, retries on IsRetryable errors when no bytes have flushed, and on exhaustion writes the final buffered upstream error verbatim to the client. - proxy/service.go: ProxyMessages and ProxyOpenAIChatCompletion dispatch through the helper. Per-attempt prep + translator construction lives in a closure so retries get fresh translator state. Failover is skipped when BYOK or inbound client credentials are present — those keys bind the request to a specific upstream and would 401 elsewhere. OTel attrs on router.upstream span: dispatch.primary_provider, dispatch.final_provider, dispatch.fallback_attempts, dispatch.failover_used. Same fields land in the ProxyMessages/ProxyOpenAIChatCompletion completion log lines. Response headers on successful failover: x-router-fallback-from and x-router-fallback-attempt expose the route the request actually took. Tests: 12 new dispatchWithFallback / classifier cases via a scripted fakeClient (transport error, retryable buffered error, non-retryable status, bytes-flushed lockout, exhaustion flush, single-binding passthrough, BYOK skip, catalog resolution). 3 new openaicompat cases for error buffering, 64KB cap, and 4xx buffering. Full router suite green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1c64562. Configure here.
| WithRoutingMarker(routingMarkerFor(routeRes)). | ||
| WithEstimatedInputTokens(feats.Tokens) | ||
| err := p.Proxy(actx, d, prep, translator, r) | ||
| return finalizeAfterProxy(err, translator.Finalize) |
There was a problem hiding this comment.
Prep body baked for primary provider, reused on fallback
High Severity
PrepareOpenAI is called once with opts.TargetProvider set to the primary provider, and the resulting prep body is captured by the attempt closure and reused on every fallback attempt. Provider-specific fields (provider hint, reasoning object, system reminder, tool-turn temperature override) are gated on targetIsOpenRouter(opts) at prep time. On fallback to a different provider, the body carries the wrong provider-specific fields. For current catalog (Fireworks primary → OpenRouter fallback on deepseek/), the fallback request to OpenRouter is missing its required provider/reasoning hints. If ordering ever reverses, the OpenRouter-specific fields would cause a 400 on Fireworks.
Additional Locations (1)
Triggered by learned rule: Emit-layer provider-specific fields must gate on TargetProvider, not model slug
Reviewed by Cursor Bugbot for commit 1c64562. Configure here.


Summary
When a model has an ordered
Providerslist in catalog (e.g.deepseek/deepseek-v4-proon Fireworks primary / OpenRouter fallback), the proxy now retries on the next binding when the first attempt fails before any bytes reach the client.The catalog data shape has supported this since #189 (unified model catalog); this PR adds the runtime dispatch.
What changes
providers— newUpstreamErrorResponsetype for buffered error bodies, plusIsRetryableStatus(408/429/5xx) andIsRetryable(err)classifier.providers/openaicompat— on>=400, buffer the response (capped atMaxBufferedErrorBytes = 64KB) and return*UpstreamErrorResponsewithout touching the client writer. Success path unchanged.router/catalog— newAvailableBindings(id, available) []ProviderBindingreturning the ordered eligible list.proxy/fallback.go(new) —dispatchWithFallbackwalks the binding list, retries onIsRetryableerrors when no bytes have flushed, and on exhaustion writes the final buffered upstream error verbatim to the client.firstByteGuardtracks whether anything has reached the client (the retry-safety invariant).proxy/service.go—ProxyMessagesandProxyOpenAIChatCompletiondispatch through the helper. Per-attempt prep + translator construction lives in a closure so retries get fresh translator state.ProxyGeminileft untouched (Gemini bindings are all single-provider).Safety carve-outs
len(bindings) == 1, loop runs once, identical to today.deploymentKeyedProviders == nil) — also returns single-attempt to avoid retrying on providers whose keys aren't actually wired.Retry envelope
Per scoping:
http.Client.Do408, 429, 5xx4xx ≠ 408/429are the client's fault — never retried; the buffered body flushes to the client immediately so they see the real upstream error.Observability
router.upstreamOTel span gains:dispatch.primary_providerdispatch.final_providerdispatch.fallback_attempts(winning binding index)dispatch.failover_used(bool, easy alert filter)Same fields land in the
ProxyMessages/ProxyOpenAIChatCompletioncompletion log lines.Response headers on successful failover:
x-router-fallback-from: <primary>andx-router-fallback-attempt: <n>.What's load-bearing
openaicompatno longer flushes error bodies through to the writer. This is what makes retry possible — onceWriteHeaderis called on the client writer, the response is committed and any retry would result in two responses on the wire. The post-loopflushBufferedIfPresentwrites the final upstream's envelope when retries are exhausted, so the customer-visible error envelope on exhaustion is unchanged in shape (just sourced from the last attempt instead of the first).firstByteGuard.writtenis the hard gate on retry. Even a retryable error type means nothing if bytes already shipped — partial SSE is on the wire and committing to that attempt is the only correct move.Test plan
go build -tags=no_onnx ./...cleango test -tags=no_onnx ./...— all 21 packages greeninternal/proxy/fallback_test.go(12 cases via scriptedfakeClient): transport error → retry, retryable buffered error → retry, non-retryable status → no retry, bytes-flushed → no retry, exhaustion → final buffered body flushed, single-binding passthrough, BYOK skip, catalog resolution,IsRetryableclassifier tableinternal/providers/openaicompat/client_test.go(3 cases): error body buffered + writer pristine, 64KB cap on body, 4xx still buffered (classifier decides retry, not adapter)FIREWORKS_API_KEYandOPENROUTER_API_KEYwired — issue adeepseek/deepseek-v4-prorequest, observedispatch.failover_used=falsebaseline; then kill the Fireworks edge (or point at a 503-serving stub) and observex-router-fallback-from: fireworkson the responseFollow-ups not in this PR
ProxyMessages/ProxyOpenAIChatCompletionend-to-end failover behavior (current tests target the helper + adapter; the service-level wiring is exercised but not asserted directly)firstByteGuarddoesn't yet supporthttp.ResponseControllerflush hints — fine today since the openaicompat success path useshttputil.StreamBodyand the failure path doesn't write at all