Skip to content

hotfix(0031): DROP IF EXISTS before ADD (constraint pre-exists on PROD)#88

Merged
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-300-w1-hotfix-0031-drop-readd
May 28, 2026
Merged

hotfix(0031): DROP IF EXISTS before ADD (constraint pre-exists on PROD)#88
hizrianraz merged 1 commit into
mainfrom
hizrianraz/ain-300-w1-hotfix-0031-drop-readd

Conversation

@hizrianraz
Copy link
Copy Markdown
Contributor

@hizrianraz hizrianraz commented May 28, 2026

Constraint was pre-applied to PROD by the repair team earlier today with the OLDER predicate (no mid-flight escape). Migration now DROPs it before re-ADDing with the AIN-300 W1 predicate.


Note

Low Risk
Schema-only migration idempotency fix; no application logic changes in this diff.

Overview
Makes Alembic revision 20260528_0031 safe to run on production, where the same-named CHECK constraint was already applied earlier with an older predicate that lacks the outcome_status IS NULL mid-flight escape and can break the two-phase insert_decisioncomplete_decision write path.

The migration now DROP CONSTRAINT IF EXISTS outcome_requires_inference_when_model_chosen on routing_outcomes before re-adding the constraint with the AIN-300 W1 predicate (outcome_status IS NULL OR …), so deploys succeed whether or not the prior manual constraint is present. Comments in the migration document the PROD repair context and idempotent behavior on fresh databases.

Reviewed by Cursor Bugbot for commit d17a812. Bugbot is set up for automated code reviews on this repo. Configure here.

…g constraint

PROD already has `outcome_requires_inference_when_model_chosen` with an
OLDER predicate (decision_rule <> 'cheapest_clearing_floor' OR
inference_id IS NOT NULL) — added by the human+Claude repair session
earlier today.

Two issues with the old predicate:
1. PG CHECK can't be DEFERRABLE → it fires at statement time, breaking
   the two-phase outcome write (insert_decision sets the row with NULL
   inference_id; complete_decision links it).
2. The W1 code now uses decision_rule_override on pre-dispatch fails,
   so the predicate's left side already evaluates true for those paths;
   the old predicate didn't account for that explicitly.

Fix: 0031 DROPs IF EXISTS first, then ADDs the new predicate with the
`outcome_status IS NULL` escape clause for mid-flight rows.

Idempotent for fresh databases (DROP IF EXISTS is a no-op there).

Refs: AIN-300 · live PROD 2026-05-28

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 28, 2026

AIN-300 🔴 [DB/Gateway] routing_outcomes orphaned (NULL inference_id) on provider error — write-path linkage + Mistral 429 failover

🔴 Live data-integrity bug — routing_outcomes written without inference_id linkage

Found 2026-05-28 during founder-directed DB save-error investigation on prod (dftfpwzqxoebwzepygzl).

Symptom (DATA repaired — root cause still open)

routing_outcomes rows were persisted with inference_id = NULL even though a model was chosen (decision_rule='cheapest_clearing_floor', chosen_model_slug present). Because the judge worker needs the linked inference to reconstruct/replay context, these rows can never be judged → they silently starve the labeled training corpus the moat depends on.

Metric At detection
Orphan outcomes (model chosen, NULL inference_id) 27
Of those, in last 24h 27 (100% — actively recurring)
Inferences stuck in routed >1h (no completion) 2
Correlation All 27 cluster in the same window as mistral error 429 failures (latest 07:21 WIB today)

Root cause (hypothesis — confirm in code)

The gateway write path appears to be: (1) create/persist the routing_outcome row → (2) execute the provider inference → (3) backfill inference_id onto the outcome. When step 2 throws (e.g. Mistral 429), step 3 never runs → the outcome is left orphaned with NULL inference_id, and the inference row is left stuck in routed (never transitions to succeeded/failed).

Two defects, one cause:

  1. Non-atomic linkage — outcome and inference are written in separate steps; a provider error between them orphans the outcome.
  2. No failure-path completion — when the provider call errors, the inference is not transitioned to failed and the outcome inference_id is never set.

Required fix (api repo — ainfera-ai/api, inference/routing write path)

  • Make outcome↔inference linkage atomic: either write the inference row first and pass its id into the outcome insert in the same transaction, OR backfill inference_id in a finally/error-handling block that runs even when the provider call raises.
  • On provider error (429/4xx/5xx/timeout): transition the inference to status='failed' with error_message set + completed_at stamped, in the same path — never leave it stuck in routed.
  • Mistral 429 specifically: add exponential backoff + retry, and on exhausted retries, failover to the next floor-clearing candidate (the router already computes the candidate set — fall through to candidate chore(deps): Bump actions/checkout from 4 to 6 #2 rather than hard-failing).
  • Add a regression test: simulate provider 429 mid-call → assert (a) inference ends failed, (b) any outcome row written has non-NULL inference_id, (c) no row left in routed.

Recurrence guard (DB — defer until app fix lands)

Once the app guarantees linkage, add a partial integrity constraint so this fails loud instead of silent:

-- Apply AFTER the app fix is deployed (else it will reject live writes)
ALTER TABLE routing_outcomes
  ADD CONSTRAINT outcome_requires_inference_when_model_chosen
  CHECK (decision_rule <> 'cheapest_clearing_floor' OR inference_id IS NOT NULL) NOT VALID;
-- then VALIDATE CONSTRAINT after backfill confirmed clean

And the judge worker (AIN-298 daily cadence) query MUST filter WHERE inference_id IS NOT NULL so the 2 legitimate no_candidate_clears_floor rows (nothing executed) never enter the judge queue.

DATA already repaired this session (reversible)

  • ✅ 27 orphan outcomes re-linked to their correct inference rows via deterministic match (agent_id + model_id + 3s window + cost match → verified 1:1 bijection, 27→27 distinct). All 27 are now judgeable (judge_status='unlabeled', ready for the cadence).
  • ✅ 2 stuck routed inferences (11d + 23h old, no response/error) marked failed with explanatory error_message + completed_at.
  • ✅ Before-state backed up to table _repair_20260528_save_error (full row JSON) for rollback.
  • ℹ️ 2 remaining NULL-inference_id outcomes are legitimate no_candidate_clears_floor (nothing executed) — left as-is; judge worker must skip them.

Acceptance

  • New routed calls with a chosen model ALWAYS write inference_id (verify: 24h with 0 new orphans)
  • 0 inferences stuck in routed >1h (monitor)
  • Mistral 429 → automatic failover to next candidate, inference does not hard-fail
  • Regression test green in CI
  • Partial CHECK constraint applied + VALIDATED after app fix

Priority

Urgent — actively corrupting the training corpus at ~27 rows/24h, and the corpus IS the moat (Spearpoint two-leg). Every orphan is a labeled-data row permanently lost.

Linked

  • AIN-295 (DB remediation — related but separate; that's RLS/view/index, this is write-path code)
  • AIN-298 (daily training cadence — judge worker must filter inference_id IS NOT NULL)
  • AIN-234 (fleet cost governance — Mistral 429 backoff overlaps)
  • AIN-290 (judge schema — the corpus this protects)

Review in Linear

@hizrianraz hizrianraz merged commit 4d78793 into main May 28, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant