hotfix(0031): DROP IF EXISTS before ADD (constraint pre-exists on PROD) by hizrianraz · Pull Request #88 · ainfera-ai/api

hizrianraz · 2026-05-28T04:30:51Z

Constraint was pre-applied to PROD by the repair team earlier today with the OLDER predicate (no mid-flight escape). Migration now DROPs it before re-ADDing with the AIN-300 W1 predicate.

Note

Low Risk
Schema-only migration idempotency fix; no application logic changes in this diff.

Overview
Makes Alembic revision 20260528_0031 safe to run on production, where the same-named CHECK constraint was already applied earlier with an older predicate that lacks the outcome_status IS NULL mid-flight escape and can break the two-phase insert_decision → complete_decision write path.

The migration now DROP CONSTRAINT IF EXISTS outcome_requires_inference_when_model_chosen on routing_outcomes before re-adding the constraint with the AIN-300 W1 predicate (outcome_status IS NULL OR …), so deploys succeed whether or not the prior manual constraint is present. Comments in the migration document the PROD repair context and idempotent behavior on fresh databases.

^{Reviewed by Cursor Bugbot for commit d17a812. Bugbot is set up for automated code reviews on this repo. Configure here.}

…g constraint PROD already has `outcome_requires_inference_when_model_chosen` with an OLDER predicate (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) — added by the human+Claude repair session earlier today. Two issues with the old predicate: 1. PG CHECK can't be DEFERRABLE → it fires at statement time, breaking the two-phase outcome write (insert_decision sets the row with NULL inference_id; complete_decision links it). 2. The W1 code now uses decision_rule_override on pre-dispatch fails, so the predicate's left side already evaluates true for those paths; the old predicate didn't account for that explicitly. Fix: 0031 DROPs IF EXISTS first, then ADDs the new predicate with the `outcome_status IS NULL` escape clause for mid-flight rows. Idempotent for fresh databases (DROP IF EXISTS is a no-op there). Refs: AIN-300 · live PROD 2026-05-28 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

linear-code · 2026-05-28T04:30:55Z

AIN-300 🔴 [DB/Gateway] routing_outcomes orphaned (NULL inference_id) on provider error — write-path linkage + Mistral 429 failover

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Found 2026-05-28 during founder-directed DB save-error investigation on prod (dftfpwzqxoebwzepygzl).

Symptom (DATA repaired — root cause still open)

routing_outcomes rows were persisted with inference_id = NULL even though a model was chosen (decision_rule='cheapest_clearing_floor', chosen_model_slug present). Because the judge worker needs the linked inference to reconstruct/replay context, these rows can never be judged → they silently starve the labeled training corpus the moat depends on.

Metric	At detection
Orphan outcomes (model chosen, NULL inference_id)	27
Of those, in last 24h	27 (100% — actively recurring)
Inferences stuck in `routed` >1h (no completion)	2
Correlation	All 27 cluster in the same window as `mistral error 429` failures (latest 07:21 WIB today)

Root cause (hypothesis — confirm in code)

The gateway write path appears to be: (1) create/persist the routing_outcome row → (2) execute the provider inference → (3) backfill inference_id onto the outcome. When step 2 throws (e.g. Mistral 429), step 3 never runs → the outcome is left orphaned with NULL inference_id, and the inference row is left stuck in routed (never transitions to succeeded/failed).

Two defects, one cause:

Non-atomic linkage — outcome and inference are written in separate steps; a provider error between them orphans the outcome.
No failure-path completion — when the provider call errors, the inference is not transitioned to failed and the outcome inference_id is never set.

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)

Make outcome↔inference linkage atomic: either write the inference row first and pass its id into the outcome insert in the same transaction, OR backfill inference_id in a finally/error-handling block that runs even when the provider call raises.
On provider error (429/4xx/5xx/timeout): transition the inference to status='failed' with error_message set + completed_at stamped, in the same path — never leave it stuck in routed.
Mistral 429 specifically: add exponential backoff + retry, and on exhausted retries, failover to the next floor-clearing candidate (the router already computes the candidate set — fall through to candidate chore(deps): Bump actions/checkout from 4 to 6 #2 rather than hard-failing).
Add a regression test: simulate provider 429 mid-call → assert (a) inference ends failed, (b) any outcome row written has non-NULL inference_id, (c) no row left in routed.

Recurrence guard (DB — defer until app fix lands)

Once the app guarantees linkage, add a partial integrity constraint so this fails loud instead of silent:

-- Apply AFTER the app fix is deployed (else it will reject live writes)
ALTER TABLE routing_outcomes
  ADD CONSTRAINT outcome_requires_inference_when_model_chosen
  CHECK (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) NOT VALID;
-- then VALIDATE CONSTRAINT after backfill confirmed clean

And the judge worker (AIN-298 daily cadence) query MUST filter WHERE inference_id IS NOT NULL so the 2 legitimate no_candidate_clears_floor rows (nothing executed) never enter the judge queue.

DATA already repaired this session (reversible)

✅ 27 orphan outcomes re-linked to their correct inference rows via deterministic match (agent_id + model_id + 3s window + cost match → verified 1:1 bijection, 27→27 distinct). All 27 are now judgeable (judge_status='unlabeled', ready for the cadence).
✅ 2 stuck routed inferences (11d + 23h old, no response/error) marked failed with explanatory error_message + completed_at.
✅ Before-state backed up to table _repair_20260528_save_error (full row JSON) for rollback.
ℹ️ 2 remaining NULL-inference_id outcomes are legitimate no_candidate_clears_floor (nothing executed) — left as-is; judge worker must skip them.

Acceptance

New routed calls with a chosen model ALWAYS write inference_id (verify: 24h with 0 new orphans)
0 inferences stuck in routed >1h (monitor)
Mistral 429 → automatic failover to next candidate, inference does not hard-fail
Regression test green in CI
Partial CHECK constraint applied + VALIDATED after app fix

Priority

Urgent — actively corrupting the training corpus at ~27 rows/24h, and the corpus IS the moat (Spearpoint two-leg). Every orphan is a labeled-data row permanently lost.

Linked

AIN-295 (DB remediation — related but separate; that's RLS/view/index, this is write-path code)
AIN-298 (daily training cadence — judge worker must filter inference_id IS NOT NULL)
AIN-234 (fleet cost governance — Mistral 429 backoff overlaps)
AIN-290 (judge schema — the corpus this protects)

Review in Linear

hizrianraz merged commit 4d78793 into main May 28, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix(0031): DROP IF EXISTS before ADD (constraint pre-exists on PROD)#88

hotfix(0031): DROP IF EXISTS before ADD (constraint pre-exists on PROD)#88
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-300-w1-hotfix-0031-drop-readd

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

linear-code Bot commented May 28, 2026 •

edited

Loading

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Symptom (DATA repaired — root cause still open)

Root cause (hypothesis — confirm in code)

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)

Recurrence guard (DB — defer until app fix lands)

DATA already repaired this session (reversible)

Acceptance

Priority

Linked

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hizrianraz commented May 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear-code Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Symptom (DATA repaired — root cause still open)

Root cause (hypothesis — confirm in code)

Required fix (api repo — ainfera-ai/api, inference/routing write path)

Recurrence guard (DB — defer until app fix lands)

DATA already repaired this session (reversible)

Acceptance

Priority

Linked

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hizrianraz commented May 28, 2026 •

edited by cursor Bot

Loading

linear-code Bot commented May 28, 2026 •

edited

Loading

Required fix (api repo — `ainfera-ai/api`, inference/routing write path)