fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#205
Closed
fede-kamel wants to merge 1 commit into
Closed
fix(oci/openai-compat): auto-refresh instance-principal token (closes silent 401 after ~15min)#205fede-kamel wants to merge 1 commit into
fede-kamel wants to merge 1 commit into
Conversation
… silent 401 after ~15min)
OCIOpenAIModel constructed its OCIRequestSigner without a refresh_signer
callback. OCIRequestSigner.auth_flow has both a refresh-on-401 branch
and a periodic-refresh branch — both early-return when refresh_signer
is None, which meant the federation token captured at process start
was used forever. On OKE, instance-principal tokens expire on the
order of 15–30 minutes, so any agent pod older than that would 401 on
every GenAI call until restarted.
Production symptom seen in almariel: chats silently fall through to
reason=error after ~15 minutes of pod uptime, with the underlying
httpx logs showing "HTTP/1.1 401 Unauthorized" on every chat.completions
call. Pod restart was the only known workaround.
The fix wires the signer's own refresh_security_token method (present
on InstancePrincipalsSecurityTokenSigner, get_resource_principals_signer()
returns a signer with the same contract, and any DelegationTokenSigner
variant that follows the OCI SDK convention) into the wrapper via a
new _refresh_callable_for(signer) helper. Static signers (user-principal
API key) have no refresh_security_token attribute and the helper
returns None, so the refresh path stays dormant for them.
refresh_interval is also tightened from the upstream 3600s default to
600s — short enough that proactive refresh beats the typical 15–30
minute federation-token TTL even if a 401 doesn't fire first.
No public-API change; the wiring is internal. Bumps version to
0.2.0b12 + changelog entry.
Test coverage:
* TestRefreshCallableFor — 3 cases (refresh attr exists, missing,
non-callable defensive path)
* TestClientWiresRefreshSigner — 2 cases (token signer gets the
callback, static signer gets None)
All 49 OCI-area unit tests pass; ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Superseded by the v2 branch with DCO sign-off + ruff format applied + clean commit metadata. New PR: #206 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OCIOpenAIModelconstructed itsOCIRequestSignerwithout arefresh_signercallback.OCIRequestSigner.auth_flowhas both a refresh-on-401 branch and a periodic-refresh branch — both early-return whenrefresh_signer is None, which meant the federation token captured at process start was used forever. On OKE, instance-principal tokens expire on the order of 15–30 minutes, so any agent pod older than that would 401 on every GenAI call until restarted.Production symptom observed in almariel: chats silently fall through to
reason=errorafter ~15 minutes of pod uptime; httpx logs showHTTP/1.1 401 Unauthorizedon everychat.completionscall. Pod restart was the only known workaround.Fix
Wires the signer's own
refresh_security_tokenmethod into the wrapper via a new_refresh_callable_for(signer)helper:InstancePrincipalsSecurityTokenSigner— hasrefresh_security_tokenget_resource_principals_signer()— returns a signer with the same contractDelegationTokenSigner(and other OCI-SDK-convention variants) — same contract, picked up viahasattrrather than isinstance checksSigner— norefresh_security_token, so the helper returnsNoneand the refresh path stays dormantrefresh_intervalis also tightened from the upstream3600.0default to600.0— short enough that proactive refresh beats the typical 15–30 minute federation-token TTL even if a 401 doesn't fire first.No public-API change; the wiring is internal.
Test plan
TestRefreshCallableFor— 3 cases (refresh attr present / missing / non-callable defensive path)TestClientWiresRefreshSigner— 2 cases (token signer gets the callback, static signer gets None)🤖 Generated with Claude Code