Skip to content

[api] AIN-300 W1: write-path linkage + 429 retry + 0031 CHECK + 0032 init-plan#87

Merged
hizrianraz merged 2 commits into
mainfrom
hizrianraz/ain-300-w1-writepath
May 28, 2026
Merged

[api] AIN-300 W1: write-path linkage + 429 retry + 0031 CHECK + 0032 init-plan#87
hizrianraz merged 2 commits into
mainfrom
hizrianraz/ain-300-w1-writepath

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 28, 2026

Summary (W1/9 — ship-now)

Kills the AIN-300 orphan write-path bug + adds the recurrence-guard CHECK constraint + 429 backoff/failover + closes the 16 perf WARNs the 0029 RLS rollout introduced.

What lands

Code

File Change
ainfera_api/services/routing.py _chat_with_429_retry helper (3-try, 0.5/2/8s, 429-only); dispatch_inference accepts optional inference_id
ainfera_api/services/routing_brain.py Pre-allocate candidate_inference_id per fallover; track last_inference_id; link in 4xx/5xx-exhausted; 429 → failover like 5xx; decision_rule_override='failed_pre_dispatch' for Cap/Funds/Inactive
ainfera_api/services/routing_outcomes.py complete_decision accepts decision_rule_override

Migrations

File Change
alembic/versions/20260528_0031_outcome_requires_inference_check.py CHECK (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) NOT VALID + VALIDATE
alembic/versions/20260528_0032_rls_initplan_optimization.py Recreate every 0029 policy with (SELECT auth.jwt() ...) wrapping. ENABLE RLS on _repair_20260528_save_error (clears the prod ERROR; table not dropped)

Tests

6 unit tests in tests/unit/test_routing_429_retry.py (all pass in 0.36s).

Disc #12 invariants

  • No change to routing scoring (q_prior/q_empirical/floor).
  • No change to candidate-set computation.
  • 429 retry/failover is OPERATIONAL resilience — same model retries (in adapter) → same candidate set failover (in brain). The set is already computed before the dispatch loop and is never recomputed.

Validation

  • pre-commit (ruff + ruff format + mypy --strict + pytest -x): passed
  • offline upgrade 0030→0032: 10,868 bytes
  • offline downgrade 0032→0030: 9,833 bytes
  • 6 unit tests on 429 retry: passed

Deploy plan (after merge)

doppler run -p ainfera-os -c prd -- alembic upgrade head    # applies 0031 + 0032
railway up                                                   # deploys gateway code
# smoke: POST routed inference → confirm outcome.inference_id set + inference not stuck in 'routed'

Refs

AIN-300 · AIN-295 · AIN-298 · Disc #12

🤖 Generated with Claude Code


Note

High Risk
Touches core inference dispatch, ledger-adjacent routing outcomes, and production migrations that must deploy after the write-path fix; mis-ordering or constraint validation failure can block deploys or break routed inference completion.

Overview
AIN-300 W1 fixes §16 routing_outcomesinferences linkage and hardens provider dispatch resilience without changing routing scores or candidate-set logic (Disc #12).

Dispatch & failover: dispatch_inference can take a caller-supplied inference_id, and provider calls go through _chat_with_429_retry (up to 3 attempts on 429 only, with 0.5s / 2s backoff). dispatch_with_brain pre-allocates an inference_id per fallback candidate, links it on success, terminal 4xx (non-429), and when all candidates fail after 5xx or exhausted 429; exhausted 429 fails over like 5xx. Pre-dispatch terminal errors (caps, funds, inactive agent) set decision_rule_override='failed_pre_dispatch' via complete_decision.

Database: Migration 0031 adds outcome_requires_inference_when_model_chosen (allows in-flight rows via outcome_status IS NULL, then requires inference_id when decision_rule = 'cheapest_clearing_floor'). 0032 recreates Supabase RLS policies with init-plan (SELECT auth.jwt() …) and enables RLS on _repair_20260528_save_error.

Tests: Six unit tests cover the 429 retry helper.

Reviewed by Cursor Bugbot for commit d7b41b9. Bugbot is set up for automated code reviews on this repo. Configure here.

… constraint + RLS init-plan

W1/9 SHIP-NOW. Kills AIN-300 orphan bug + 429 backoff/failover + new
CHECK constraint guards future regressions + clears 16 perf WARNs from
0029.

routing.py:
- _chat_with_429_retry helper (3 attempts, 0.5/2/8s, 429-only)
- dispatch_inference accepts optional inference_id kwarg

routing_brain.py:
- Pre-allocate candidate_inference_id per fallover attempt
- Track last_inference_id; link in 4xx/5xx-exhausted terminal branches
- 429 (after in-adapter retry exhaust) → failover like 5xx
- Cap/Funds/Inactive use decision_rule_override='failed_pre_dispatch'

routing_outcomes.py:
- complete_decision gains decision_rule_override kwarg

alembic 0031: outcome_requires_inference CHECK constraint
alembic 0032: init-plan optimization + ENABLE RLS on _repair_ table

tests/unit/test_routing_429_retry.py: 6 tests, all pass

Validation:
- pre-commit (ruff + ruff format + mypy --strict + pytest -x): passed
- offline upgrade 0030→0032: 10,868 bytes
- offline downgrade 0032→0030: 9,833 bytes

Refs: AIN-300 · AIN-295 · AIN-298 · Disc #12 preserved on scoring/candidate-set

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

AIN-300 🔴 [DB/Gateway] routing_outcomes orphaned (NULL inference_id) on provider error — write-path linkage + Mistral 429 failover

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Found 2026-05-28 during founder-directed DB save-error investigation on prod (dftfpwzqxoebwzepygzl).

Symptom (DATA repaired — root cause still open)

routing_outcomes rows were persisted with inference_id = NULL even though a model was chosen (decision_rule='cheapest_clearing_floor', chosen_model_slug present). Because the judge worker needs the linked inference to reconstruct/replay context, these rows can never be judged → they silently starve the labeled training corpus the moat depends on.

Metric At detection
Orphan outcomes (model chosen, NULL inference_id) 27
Of those, in last 24h 27 (100% — actively recurring)
Inferences stuck in routed >1h (no completion) 2
Correlation All 27 cluster in the same window as mistral error 429 failures (latest 07:21 WIB today)

Root cause (hypothesis — confirm in code)

The gateway write path appears to be: (1) create/persist the routing_outcome row → (2) execute the provider inference → (3) backfill inference_id onto the outcome. When step 2 throws (e.g. Mistral 429), step 3 never runs → the outcome is left orphaned with NULL inference_id, and the inference row is left stuck in routed (never transitions to succeeded/failed).

Two defects, one cause:

  1. Non-atomic linkage — outcome and inference are written in separate steps; a provider error between them orphans the outcome.
  2. No failure-path completion — when the provider call errors, the inference is not transitioned to failed and the outcome inference_id is never set.

Required fix (api repo — ainfera-ai/api, inference/routing write path)

  • Make outcome↔inference linkage atomic: either write the inference row first and pass its id into the outcome insert in the same transaction, OR backfill inference_id in a finally/error-handling block that runs even when the provider call raises.
  • On provider error (429/4xx/5xx/timeout): transition the inference to status='failed' with error_message set + completed_at stamped, in the same path — never leave it stuck in routed.
  • Mistral 429 specifically: add exponential backoff + retry, and on exhausted retries, failover to the next floor-clearing candidate (the router already computes the candidate set — fall through to candidate chore(deps): Bump actions/checkout from 4 to 6 #2 rather than hard-failing).
  • Add a regression test: simulate provider 429 mid-call → assert (a) inference ends failed, (b) any outcome row written has non-NULL inference_id, (c) no row left in routed.

Recurrence guard (DB — defer until app fix lands)

Once the app guarantees linkage, add a partial integrity constraint so this fails loud instead of silent:

-- Apply AFTER the app fix is deployed (else it will reject live writes)
ALTER TABLE routing_outcomes
  ADD CONSTRAINT outcome_requires_inference_when_model_chosen
  CHECK (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) NOT VALID;
-- then VALIDATE CONSTRAINT after backfill confirmed clean

And the judge worker (AIN-298 daily cadence) query MUST filter WHERE inference_id IS NOT NULL so the 2 legitimate no_candidate_clears_floor rows (nothing executed) never enter the judge queue.

DATA already repaired this session (reversible)

  • ✅ 27 orphan outcomes re-linked to their correct inference rows via deterministic match (agent_id + model_id + 3s window + cost match → verified 1:1 bijection, 27→27 distinct). All 27 are now judgeable (judge_status='unlabeled', ready for the cadence).
  • ✅ 2 stuck routed inferences (11d + 23h old, no response/error) marked failed with explanatory error_message + completed_at.
  • ✅ Before-state backed up to table _repair_20260528_save_error (full row JSON) for rollback.
  • ℹ️ 2 remaining NULL-inference_id outcomes are legitimate no_candidate_clears_floor (nothing executed) — left as-is; judge worker must skip them.

Acceptance

  • New routed calls with a chosen model ALWAYS write inference_id (verify: 24h with 0 new orphans)
  • 0 inferences stuck in routed >1h (monitor)
  • Mistral 429 → automatic failover to next candidate, inference does not hard-fail
  • Regression test green in CI
  • Partial CHECK constraint applied + VALIDATED after app fix

Priority

Urgent — actively corrupting the training corpus at ~27 rows/24h, and the corpus IS the moat (Spearpoint two-leg). Every orphan is a labeled-data row permanently lost.

Linked

  • AIN-295 (DB remediation — related but separate; that's RLS/view/index, this is write-path code)
  • AIN-298 (daily training cadence — judge worker must filter inference_id IS NOT NULL)
  • AIN-234 (fleet cost governance — Mistral 429 backoff overlaps)
  • AIN-290 (judge schema — the corpus this protects)

Review in Linear

…edicate

PG CHECK constraints don't support DEFERRABLE/DEFERRED (only FK/UNIQUE
/PK/EXCLUDE do). The two-phase write (insert_decision creates the row
with decision_rule='cheapest_clearing_floor' + inference_id=NULL,
complete_decision links inference_id after dispatch) has a transient
moment that the per-statement check would reject.

Predicate now allows outcome_status IS NULL as the third escape clause:

  CHECK (
    outcome_status IS NULL
    OR decision_rule <> 'cheapest_clearing_floor'
    OR inference_id IS NOT NULL
  )

Once complete_decision sets outcome_status (always non-NULL on every
terminal branch — succeeded/failed_other/failed_provider_error/rejected*),
the constraint REQUIRES either decision_rule rewritten via
decision_rule_override OR inference_id linked. Which IS the AIN-300 W1
invariant.

Integration tests now pass (the failing tests were inserting via the
two-phase pattern and hitting the per-statement check).

Refs: AIN-300

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hizrianraz hizrianraz merged commit d062c7f into main May 28, 2026
4 checks passed
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.

Reviewed by Cursor Bugbot for commit d7b41b9. Configure here.

db,
outcome_id=outcome_id,
outcome_status="failed_provider_error",
inference_id=last_inference_id,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All-ModelUnavailable exhaustion violates new CHECK constraint

High Severity

When every candidate in the fallback loop fails with ModelUnavailableError, last_inference_id remains None (it's only set in the ProviderError handler). The all-exhausted path (section 7) then calls complete_decision with inference_id=None and no decision_rule_override. Since complete_decision skips setting inference_id when it's None, the outcome row ends up with outcome_status='failed_provider_error', decision_rule='cheapest_clearing_floor', and inference_id=NULL — violating the new outcome_requires_inference_when_model_chosen CHECK constraint from migration 0031. This causes a database IntegrityError instead of a clean AllCandidatesFailedError.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d7b41b9. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant