[api] AIN-300 W1: write-path linkage + 429 retry + 0031 CHECK + 0032 init-plan by hizrianraz · Pull Request #87 · ainfera-ai/api

hizrianraz · 2026-05-28T04:23:22Z

Summary (W1/9 — ship-now)

Kills the AIN-300 orphan write-path bug + adds the recurrence-guard CHECK constraint + 429 backoff/failover + closes the 16 perf WARNs the 0029 RLS rollout introduced.

What lands

Code

File	Change
`ainfera_api/services/routing.py`	`_chat_with_429_retry` helper (3-try, 0.5/2/8s, 429-only); `dispatch_inference` accepts optional `inference_id`
`ainfera_api/services/routing_brain.py`	Pre-allocate `candidate_inference_id` per fallover; track `last_inference_id`; link in 4xx/5xx-exhausted; 429 → failover like 5xx; `decision_rule_override='failed_pre_dispatch'` for Cap/Funds/Inactive
`ainfera_api/services/routing_outcomes.py`	`complete_decision` accepts `decision_rule_override`

Migrations

File	Change
`alembic/versions/20260528_0031_outcome_requires_inference_check.py`	CHECK (`decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL`) NOT VALID + VALIDATE
`alembic/versions/20260528_0032_rls_initplan_optimization.py`	Recreate every 0029 policy with `(SELECT auth.jwt() ...)` wrapping. ENABLE RLS on `_repair_20260528_save_error` (clears the prod ERROR; table not dropped)

Tests

6 unit tests in tests/unit/test_routing_429_retry.py (all pass in 0.36s).

Disc #12 invariants

No change to routing scoring (q_prior/q_empirical/floor).
No change to candidate-set computation.
429 retry/failover is OPERATIONAL resilience — same model retries (in adapter) → same candidate set failover (in brain). The set is already computed before the dispatch loop and is never recomputed.

Validation

pre-commit (ruff + ruff format + mypy --strict + pytest -x): passed
offline upgrade 0030→0032: 10,868 bytes
offline downgrade 0032→0030: 9,833 bytes
6 unit tests on 429 retry: passed

Deploy plan (after merge)

doppler run -p ainfera-os -c prd -- alembic upgrade head    # applies 0031 + 0032
railway up                                                   # deploys gateway code
# smoke: POST routed inference → confirm outcome.inference_id set + inference not stuck in 'routed'

Refs

AIN-300 · AIN-295 · AIN-298 · Disc #12

🤖 Generated with Claude Code

Note

High Risk
Touches core inference dispatch, ledger-adjacent routing outcomes, and production migrations that must deploy after the write-path fix; mis-ordering or constraint validation failure can block deploys or break routed inference completion.

Overview
AIN-300 W1 fixes §16 routing_outcomes ↔ inferences linkage and hardens provider dispatch resilience without changing routing scores or candidate-set logic (Disc #12).

Dispatch & failover: dispatch_inference can take a caller-supplied inference_id, and provider calls go through _chat_with_429_retry (up to 3 attempts on 429 only, with 0.5s / 2s backoff). dispatch_with_brain pre-allocates an inference_id per fallback candidate, links it on success, terminal 4xx (non-429), and when all candidates fail after 5xx or exhausted 429; exhausted 429 fails over like 5xx. Pre-dispatch terminal errors (caps, funds, inactive agent) set decision_rule_override='failed_pre_dispatch' via complete_decision.

Database: Migration 0031 adds outcome_requires_inference_when_model_chosen (allows in-flight rows via outcome_status IS NULL, then requires inference_id when decision_rule = 'cheapest_clearing_floor'). 0032 recreates Supabase RLS policies with init-plan (SELECT auth.jwt() …) and enables RLS on _repair_20260528_save_error.

Tests: Six unit tests cover the 429 retry helper.

^{Reviewed by Cursor Bugbot for commit d7b41b9. Bugbot is set up for automated code reviews on this repo. Configure here.}

… constraint + RLS init-plan W1/9 SHIP-NOW. Kills AIN-300 orphan bug + 429 backoff/failover + new CHECK constraint guards future regressions + clears 16 perf WARNs from 0029. routing.py: - _chat_with_429_retry helper (3 attempts, 0.5/2/8s, 429-only) - dispatch_inference accepts optional inference_id kwarg routing_brain.py: - Pre-allocate candidate_inference_id per fallover attempt - Track last_inference_id; link in 4xx/5xx-exhausted terminal branches - 429 (after in-adapter retry exhaust) → failover like 5xx - Cap/Funds/Inactive use decision_rule_override='failed_pre_dispatch' routing_outcomes.py: - complete_decision gains decision_rule_override kwarg alembic 0031: outcome_requires_inference CHECK constraint alembic 0032: init-plan optimization + ENABLE RLS on _repair_ table tests/unit/test_routing_429_retry.py: 6 tests, all pass Validation: - pre-commit (ruff + ruff format + mypy --strict + pytest -x): passed - offline upgrade 0030→0032: 10,868 bytes - offline downgrade 0032→0030: 9,833 bytes Refs: AIN-300 · AIN-295 · AIN-298 · Disc #12 preserved on scoring/candidate-set Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear-code · 2026-05-28T04:23:25Z

AIN-300 🔴 [DB/Gateway] routing_outcomes orphaned (NULL inference_id) on provider error — write-path linkage + Mistral 429 failover

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Found 2026-05-28 during founder-directed DB save-error investigation on prod (dftfpwzqxoebwzepygzl).

Symptom (DATA repaired — root cause still open)

routing_outcomes rows were persisted with inference_id = NULL even though a model was chosen (decision_rule='cheapest_clearing_floor', chosen_model_slug present). Because the judge worker needs the linked inference to reconstruct/replay context, these rows can never be judged → they silently starve the labeled training corpus the moat depends on.

Metric	At detection
Orphan outcomes (model chosen, NULL inference_id)	27
Of those, in last 24h	27 (100% — actively recurring)
Inferences stuck in `routed` >1h (no completion)	2
Correlation	All 27 cluster in the same window as `mistral error 429` failures (latest 07:21 WIB today)

Root cause (hypothesis — confirm in code)

The gateway write path appears to be: (1) create/persist the routing_outcome row → (2) execute the provider inference → (3) backfill inference_id onto the outcome. When step 2 throws (e.g. Mistral 429), step 3 never runs → the outcome is left orphaned with NULL inference_id, and the inference row is left stuck in routed (never transitions to succeeded/failed).

Two defects, one cause:

Non-atomic linkage — outcome and inference are written in separate steps; a provider error between them orphans the outcome.
No failure-path completion — when the provider call errors, the inference is not transitioned to failed and the outcome inference_id is never set.

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)

Make outcome↔inference linkage atomic: either write the inference row first and pass its id into the outcome insert in the same transaction, OR backfill inference_id in a finally/error-handling block that runs even when the provider call raises.
On provider error (429/4xx/5xx/timeout): transition the inference to status='failed' with error_message set + completed_at stamped, in the same path — never leave it stuck in routed.
Mistral 429 specifically: add exponential backoff + retry, and on exhausted retries, failover to the next floor-clearing candidate (the router already computes the candidate set — fall through to candidate chore(deps): Bump actions/checkout from 4 to 6 #2 rather than hard-failing).
Add a regression test: simulate provider 429 mid-call → assert (a) inference ends failed, (b) any outcome row written has non-NULL inference_id, (c) no row left in routed.

Recurrence guard (DB — defer until app fix lands)

Once the app guarantees linkage, add a partial integrity constraint so this fails loud instead of silent:

-- Apply AFTER the app fix is deployed (else it will reject live writes)
ALTER TABLE routing_outcomes
  ADD CONSTRAINT outcome_requires_inference_when_model_chosen
  CHECK (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) NOT VALID;
-- then VALIDATE CONSTRAINT after backfill confirmed clean

And the judge worker (AIN-298 daily cadence) query MUST filter WHERE inference_id IS NOT NULL so the 2 legitimate no_candidate_clears_floor rows (nothing executed) never enter the judge queue.

DATA already repaired this session (reversible)

✅ 27 orphan outcomes re-linked to their correct inference rows via deterministic match (agent_id + model_id + 3s window + cost match → verified 1:1 bijection, 27→27 distinct). All 27 are now judgeable (judge_status='unlabeled', ready for the cadence).
✅ 2 stuck routed inferences (11d + 23h old, no response/error) marked failed with explanatory error_message + completed_at.
✅ Before-state backed up to table _repair_20260528_save_error (full row JSON) for rollback.
ℹ️ 2 remaining NULL-inference_id outcomes are legitimate no_candidate_clears_floor (nothing executed) — left as-is; judge worker must skip them.

Acceptance

New routed calls with a chosen model ALWAYS write inference_id (verify: 24h with 0 new orphans)
0 inferences stuck in routed >1h (monitor)
Mistral 429 → automatic failover to next candidate, inference does not hard-fail
Regression test green in CI
Partial CHECK constraint applied + VALIDATED after app fix

Priority

Urgent — actively corrupting the training corpus at ~27 rows/24h, and the corpus IS the moat (Spearpoint two-leg). Every orphan is a labeled-data row permanently lost.

Linked

AIN-295 (DB remediation — related but separate; that's RLS/view/index, this is write-path code)
AIN-298 (daily training cadence — judge worker must filter inference_id IS NOT NULL)
AIN-234 (fleet cost governance — Mistral 429 backoff overlaps)
AIN-290 (judge schema — the corpus this protects)

Review in Linear

…edicate PG CHECK constraints don't support DEFERRABLE/DEFERRED (only FK/UNIQUE /PK/EXCLUDE do). The two-phase write (insert_decision creates the row with decision_rule='cheapest_clearing_floor' + inference_id=NULL, complete_decision links inference_id after dispatch) has a transient moment that the per-statement check would reject. Predicate now allows outcome_status IS NULL as the third escape clause: CHECK ( outcome_status IS NULL OR decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL ) Once complete_decision sets outcome_status (always non-NULL on every terminal branch — succeeded/failed_other/failed_provider_error/rejected*), the constraint REQUIRES either decision_rule rewritten via decision_rule_override OR inference_id linked. Which IS the AIN-300 W1 invariant. Integration tests now pass (the failing tests were inserting via the two-phase pattern and hitting the per-statement check). Refs: AIN-300 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is ON. A cloud agent has been kicked off to fix the reported issue.}

^{Reviewed by Cursor Bugbot for commit d7b41b9. Configure here.}

cursor · 2026-05-28T04:35:31Z

        db,
        outcome_id=outcome_id,
        outcome_status="failed_provider_error",
+        inference_id=last_inference_id,


All-ModelUnavailable exhaustion violates new CHECK constraint

High Severity

When every candidate in the fallback loop fails with ModelUnavailableError, last_inference_id remains None (it's only set in the ProviderError handler). The all-exhausted path (section 7) then calls complete_decision with inference_id=None and no decision_rule_override. Since complete_decision skips setting inference_id when it's None, the outcome row ends up with outcome_status='failed_provider_error', decision_rule='cheapest_clearing_floor', and inference_id=NULL — violating the new outcome_requires_inference_when_model_chosen CHECK constraint from migration 0031. This causes a database IntegrityError instead of a clean AllCandidatesFailedError.

Additional Locations (1)

ainfera_api/services/routing_brain.py#L591-L608

^{Reviewed by Cursor Bugbot for commit d7b41b9. Configure here.}

hizrianraz merged commit d062c7f into main May 28, 2026
4 checks passed

cursor Bot reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api] AIN-300 W1: write-path linkage + 429 retry + 0031 CHECK + 0032 init-plan#87

[api] AIN-300 W1: write-path linkage + 429 retry + 0031 CHECK + 0032 init-plan#87
hizrianraz merged 2 commits into
mainfrom
hizrianraz/ain-300-w1-writepath

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

linear-code Bot commented May 28, 2026 •

edited

Loading

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Symptom (DATA repaired — root cause still open)

Root cause (hypothesis — confirm in code)

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)

Recurrence guard (DB — defer until app fix lands)

DATA already repaired this session (reversible)

Acceptance

Priority

Linked

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (W1/9 — ship-now)

What lands

Code

Migrations

Tests

Disc #12 invariants

Validation

Deploy plan (after merge)

Refs

Uh oh!

linear-code Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Symptom (DATA repaired — root cause still open)

Root cause (hypothesis — confirm in code)

Required fix (api repo — ainfera-ai/api, inference/routing write path)

Recurrence guard (DB — defer until app fix lands)

DATA already repaired this session (reversible)

Acceptance

Priority

Linked

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 28, 2026

Choose a reason for hiding this comment

All-ModelUnavailable exhaustion violates new CHECK constraint

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

linear-code Bot commented May 28, 2026 •

edited

Loading

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)