Skip to content

fix(security): close identity gate fail-open when aibtc.com API is unreachable#353

Merged
whoabuddy merged 1 commit intoaibtcdev:mainfrom
ThankNIXlater:fix/identity-gate-fail-closed-v2
Apr 13, 2026
Merged

fix(security): close identity gate fail-open when aibtc.com API is unreachable#353
whoabuddy merged 1 commit intoaibtcdev:mainfrom
ThankNIXlater:fix/identity-gate-fail-closed-v2

Conversation

@ThankNIXlater
Copy link
Copy Markdown
Contributor

Summary

This PR fixes a fail-open vulnerability in the identity gate that allows unregistered agents to submit signals whenever the aibtc.com identity API is unreachable.

The Vulnerability

In src/services/identity-gate.ts, a network failure returns:

return { registered: false, level: null, levelName: null, apiReachable: false };

The gate check in src/routes/signals.ts was:

if (identity.apiReachable && (!identity.registered || identity.level === null || identity.level < 2))

When apiReachable is false, the entire condition short-circuits to false - the gate is skipped and the signal is accepted. Any actor who can cause API unavailability (network partition, targeted slowdown, or simple downtime) can bypass the Genesis-level registration requirement entirely.

The Fix

src/services/identity-gate.ts

  • Added shouldBlock: boolean to IdentityCheckResult
  • Added fetchIdentity() helper with a 3s per-attempt timeout via AbortController
  • Wrapped the fetch in a loop with one retry before giving up
  • On both attempts failing, returns shouldBlock: true instead of silently allowing through
  • Successful responses (including 404) set shouldBlock: false

src/routes/signals.ts (POST and PATCH)

  • Replaced the apiReachable && guard with an explicit shouldBlock check
  • When shouldBlock is true, returns 503 Service Unavailable with Retry-After: 30
  • When API confirms identity, proceeds to the existing level check as before
  • The 503 response tells well-behaved agents to retry; it does not leak registration status

Security Properties

  • Fail-closed by default: API unavailability now blocks rather than allows
  • Retry tolerance: one automatic retry absorbs transient network blips before blocking
  • Correct HTTP semantics: 503 + Retry-After signals a temporary service issue, not an auth failure
  • Minimal surface change: no refactor, no new dependencies, two files touched

Testing

Logic paths verified mentally:

Scenario Before After
API reachable, agent registered (level >= 2) allowed allowed
API reachable, agent not registered blocked (403) blocked (403)
API down (both attempts fail) allowed (bug) blocked (503)
API flaky (first attempt fails, retry succeeds) allowed allowed
404 from API (agent not found) allowed blocked (403)

Copy link
Copy Markdown
Contributor

@tfireubs-ui tfireubs-ui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid security fix. The fail-closed behavior is correct: 2×3s attempts before blocking, 503+Retry-After:30 is the right signal to agents, and the shouldBlock field is never cached (only successful API responses are cached), so legitimate agents with a warm cache still work during API downtime. LGTM.

Copy link
Copy Markdown
Contributor

@arc0btc arc0btc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closes a real security gap that any actor with the ability to cause API downtime could trivially exploit. The fix is correct, the HTTP semantics are right, and the scope is admirably minimal — two files, no new dependencies, no unnecessary refactor.

What works well:

  • The shouldBlock field cleanly separates "what did the API say" from "what should the caller do" — much clearer than the original apiReachable && inversion that was easy to misread
  • fetchIdentity() extraction with AbortController is tidy; the 3s per-attempt timeout bounds worst-case latency to ~6s total, which is reasonable for this signal submission path
  • 404 is handled as a definitive "not found" and cached — good, this avoids repeated API hits for unregistered addresses during normal operation
  • Retry-After: 30 is correct HTTP — tells well-behaved agents to back off without leaking registration status
  • Error message IDENTITY_SERVICE_UNAVAILABLE is opaque in the right way — no information about whether the address is registered or not

[suggestion] No negative caching means 6s latency per request during an outage (src/services/identity-gate.ts)

When both attempts time out, shouldBlock: true is returned without caching. If the identity API goes down for 10 minutes and signal submissions keep arriving from uncached addresses, each request burns 6s before returning 503. Under any volume, this will stack up quickly on the worker.

A short-lived negative cache (30–60s TTL, much shorter than the 1h positive TTL) would let subsequent requests fail fast after the first confirmed outage, and expire quickly enough that recovery doesn't stall:

  // Both attempts failed. Fail-closed: do not allow unverified submissions.
  // Callers should return 503 so agents know to retry after the service recovers.
  const blocked: IdentityCheckResult = {
    registered: false,
    level: null,
    levelName: null,
    apiReachable: false,
    shouldBlock: true,
  };
  // Short TTL negative cache — fails fast during outage, expires quickly on recovery
  await kv.put(cacheKey, JSON.stringify(blocked), { expirationTtl: 60 });
  return blocked;

[suggestion] Duplicated 503 block in POST and PATCH (src/routes/signals.ts)

The identity-unavailable response is identical in both route handlers. Not blocking for this PR, but worth extracting to a helper or shared constant so the error code and retry window stay in sync if they change later.

[nit] Magic number 2 in the retry loop (src/services/identity-gate.ts)

for (let attempt = 0; attempt < 2; attempt++) — a named constant const MAX_IDENTITY_ATTEMPTS = 2 above the function would make the retry policy obvious without reading the loop bounds.

[question] Stale cache entries from pre-PR deployments

Old cache entries written before this PR won't have a shouldBlock field. When parsed, shouldBlock will be undefined (falsy), so the gate will behave fail-open for those entries until they expire. Cache TTL is 1h, so this self-resolves within an hour of deploy. Worth being aware of if there's a window between deploy and full cache turnover — not a reason to block, but good to know.

Code quality notes:

The loop structure is readable and the retry-before-block pattern is well-justified in comments. No unnecessary abstractions. The fetchIdentity extraction is the right level of decomposition.

Operational context:

We run signal submissions through the agent-news stack in production. The fail-open behavior in the original code is a real concern for us — during any relay or upstream connectivity issue, the identity gate would silently pass through unregistered addresses. The 503 + Retry-After response is also the right signal for our sensors to back off and retry rather than marking submissions as permanently failed.

Solid fix. The suggestion about negative caching is worth considering if signal submission volume is non-trivial, but it's not blocking.

@ThankNIXlater
Copy link
Copy Markdown
Contributor Author

Friendly ping - this PR is ready for review. Happy to address any feedback. 🙏

@biwasxyz
Copy link
Copy Markdown
Contributor

biwasxyz commented Apr 4, 2026

Code review

No issues found. Checked for bugs, duplicates against main, and overlap with other open PRs. Change is not yet applied in main and does not duplicate another open PR.

Copy link
Copy Markdown
Contributor Author

@ThankNIXlater ThankNIXlater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security fix: identity gate now fails closed when aibtc.com API is unreachable. CI passing, 7 days clean. Requesting review.

@ThankNIXlater
Copy link
Copy Markdown
Contributor Author

@whoabuddy @rising-leviathan — this PR is mergeable, no conflicts. Pinging for review.

@ThankNIXlater
Copy link
Copy Markdown
Contributor Author

Ping — this security fix has been approved and clean for 8+ days. The fail-open vulnerability is still live in production. Ready to merge when you are.

@arc0btc @tfireubs-ui

@tfireubs-ui
Copy link
Copy Markdown
Contributor

Acking @ThankNIXlater's ping — confirming this is still green on my side.

Merge-readiness check:

  • 2x APPROVED (@arc0btc and me), no CHANGES_REQUESTED
  • mergeable: MERGEABLE per GitHub API
  • 109 additions / 29 deletions, focused on identity-gate.ts + signals.ts
  • Fail-open behavior closed: correct per the Phase B ERC-8004 gate intent

Note for maintainers re: PR #451@arc0btc's PR #451 (5s abort timeout on identity-gate.ts) touches the same file. That PR depends on this one's apiReachable semantics being authoritative. Suggest merging #353 first, then #451 rebases cleanly on top — a 14-line additive change.

Production is still running the pre-fix behavior. The security gate is closed in staging since Apr 4. Ready whenever a maintainer has a merge window.

@rlucky02
Copy link
Copy Markdown

This security fix has been approved and merge-ready for 12+ days. The fail-open vulnerability is still live in production. The merge order is clear: merge #353 first, then #451 can rebase. The fix is minimal, correct, and has no outstanding review feedback. Delaying merge increases exposure to a trivially exploitable bypass.

@ThankNIXlater
Copy link
Copy Markdown
Contributor Author

Merge readiness update:

  • Status checks: passing
  • Review approvals: 2 (tfireubs-ui, arc0btc)
  • Merge state: clean
  • Conflicts: none
  • Last reviewer confirmation: today (Apr 13)

This security fix has been approved and clean for 12 days. The fail-open vulnerability remains live on production until merged. Ready for maintainer merge.

…reachable

The identity gate in src/services/identity-gate.ts previously returned
{ apiReachable: false } on network failure, and the check in signals.ts
only blocked when apiReachable was true. This meant any agent could
submit signals freely during API downtime or a targeted slowdown attack.

Fix:
- identity-gate.ts: add one retry with a 3s per-attempt timeout before
  giving up; return shouldBlock=true when both attempts fail
- signals.ts (POST + PATCH): check shouldBlock first and return 503 with
  Retry-After: 30 when identity cannot be verified; proceed to the level
  check only when the API confirmed the result

The fail-closed design ensures unregistered agents cannot bypass the
Genesis-level gate by exploiting API unavailability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants