Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests)

## The bug — concrete narrative

A member registered an agent at `https://www.harvingupta.xyz/api/mcp` through Addie. After registration:

- `/dashboard/agents` rendered the agent card with **Run** / **Preview** buttons next to every storyboard the registry catalog knows about (Capability discovery, Pagination cursor integrity, Determinism testing, etc.).
- `GET /api/registry/agents/<url>/compliance` returned `{ "status": "unknown", "last_checked_at": null, "tracks": {} }`.

The owner's reasonable assumption: *"the dashboard is showing me storyboards I'm running, why does the API say `unknown`?"*

Pod audit (`harvingupta.xyz/api/mcp`):

| Table | Row count |
|---|---|
| `agent_registry_metadata` | 0 |
| `agent_compliance_status` | 0 |
| `agent_compliance_runs` | 0 |
| `agent_storyboard_status` | 0 |
| `agent_capabilities_snapshot` | 1 (`inferred_type: sales`) |
| `agent_health_snapshot` | 1 (`online: true`, 147ms) |
| `agent_verification_badges` | 0 |
| `discovered_agents` | 0 |
| `agent_contexts` | 1 (`total_tests_run: 0`) |
| `agent_test_history` | 0 |

So in this concrete case nobody had actually run a storyboard yet — but the bug is structural regardless: **state is split across two write paths and one read path**, and the read path reflects only one of the writers.

## The split

| Write path | Tables | Triggered by | Visible to `/api/registry/agents/:url/compliance`? |
|---|---|---|---|
| **Heartbeat** (`compliance-heartbeat.ts`) | `agent_compliance_status`, `agent_compliance_runs`, `agent_storyboard_status`, `agent_registry_metadata` | The 12h cron | ✅ |
| **Addie on-demand** (`evaluate_agent_quality`, `run_storyboard`) | `agent_contexts`, `agent_test_history` (via `recordTest`) | Owner clicking Run / asking Addie / `evaluate_agent_quality` MCP call | ❌ |

Same activity (running a storyboard against an agent), different table, different read endpoint. The dashboard's compliance tile reads the first path; "your test history" reads the second; the public API reads the first only.

Three concrete consequences:

1. An owner who runs `evaluate_agent_quality` from Addie chat sees results in chat, but the public registry keeps saying `unknown` until the heartbeat runs ~12h later.
2. The dashboard's compliance tile and the dashboard's test history are reading disjoint state — if a row in one path drifts from the other, neither is canonical.
3. `agent_contexts.last_test_*` duplicates `agent_test_history` which duplicates what `agent_compliance_status` would say if the heartbeat had run. Three copies, three slightly different lifecycles.

## Goal

**One canonical compliance state.** Every storyboard run — heartbeat, owner-triggered, dashboard-Run-button, Addie `evaluate_agent_quality` — writes to the same canonical tables (`agent_compliance_status` + `agent_compliance_runs` + `agent_storyboard_status`), distinguished by `triggered_by` and `triggered_org_id`, never by which table got written. Dashboard, public registry, and Addie all read from that one place.

Drop `agent_test_history` once writes are unified. Keep `agent_contexts` for what it actually carries (auth credentials, saved per-org context); stop using its `last_test_*` columns as duplicate state — derive them from a join against `agent_compliance_runs` instead.

**Out of scope** for this initiative: `agent_capabilities_snapshot`, `agent_health_snapshot`, `discovered_agents`, `agent_verification_badges`. Those track genuinely different lifecycle events (capability probe, liveness, crawler discovery, earned badges) and should stay separate. The unification is specifically about *storyboard run results*.

## Quantify the work — fill in before opening PR 1

Before any PR opens, the implementer fills in:

| Metric | Value |
|---|---|
| `recordTest()` call sites | _TBD — `git grep -n "recordTest" server/src` and list_ |
| `agent_test_history` row count (owner-triggered) | _TBD — `SELECT COUNT(*) FROM agent_test_history WHERE triggered_by = 'user'`_ |
| `agent_test_history` row count (third-party) | _TBD — same, `triggered_by != 'user'`_ |
| `agent_test_history` total | _TBD_ |
| `agent_contexts.last_test_*` reader call sites | _TBD — `git grep -n "last_test_" server/src`_ |
| DB writes added per owner test (PR 1) | _TBD — count INSERT/UPDATE statements in `complianceResultToDbInput` path_ |
| Existing tests covering `agent_compliance_status` write path | _TBD — `grep -rn "agent_compliance_status" server/tests`_ |

Numbers anchor the conversation. Don't open PR 1 until this table is filled.

## Tradeoff settled — option 1 (owner-only writes to canonical state)

Non-owner runs return a session-scoped result in the response and never persist to `agent_compliance_status`. Owner-triggered runs (resolved via existing `resolveOwnerMembership`) write canonical. Fail-closed beats fail-open; respects the audit-grade-public-record invariant.

## Frozen reporting contract — semantic shift on `/api/registry/agents/:url/compliance`

This endpoint is a public registry surface. **What it reflects is changing**, even though the field names aren't:

- **Pre-fix:** "this agent's last *scheduled* verdict" (heartbeat-only).
- **Post-fix:** "this agent's last verdict from any source" (heartbeat + owner_test).

A consumer who scraped this endpoint daily learned heartbeat history; post-fix they see any verdict, possibly an owner running it on demand to test a fix. **Tell the caller, don't guess silently.** PR 1 MUST do one of:

1. **Add `verdict_source: 'heartbeat' | 'owner_test'`** to the response shape so consumers can filter. Recommended.
2. **Add a SHOULD-clause to the changeset** explicitly naming the semantic shift for downstream scrapers, with a docs page link.

PR 1's PR description picks one and locks it in. The `OperatorLookupResultSchema` analog for compliance gets the additive field if option 1 wins.

## Last-write-wins race rule — pin in PR 1

Heartbeat and owner_test can fire within minutes. Two concurrent writers, one row in `agent_compliance_status`. The unification's conflict-resolution rule is **last-write-wins on `(agent_url)`**: the latest verdict wins, regardless of source.

PR 1 includes a test that pins this:

```ts
// heartbeat at T, owner_test at T+1s, endpoint reflects T+1s
// then owner_test at T+2s, heartbeat at T+3s, endpoint reflects T+3s
```

A future refactor that switches to "first-write-wins" or "merge" silently changes the public contract. The test is the contract.

## Plan — four small PRs, stacked

### PR 1 — Owner-triggered storyboard runs unify with heartbeat writes

**What changes:** `evaluate_agent_quality` and `run_storyboard` (when triggered by an owner of the agent) call the same DB writes the heartbeat does:

- `complianceResultToDbInput()` shared between both paths.
- `agent_compliance_status` upserted with the new verdict.
- `agent_compliance_runs` row inserted with `triggered_by = 'owner_test'` + `triggered_org_id = <owner-org>`.
- `agent_storyboard_status` rows updated per storyboard.
- Ownership resolution via existing `resolveOwnerMembership` — non-owner runs short-circuit to session-scoped (current `agent_test_history` write path retained for non-owner runs in this PR; deprecated in PR 3).

**Schema:**
- Add `'owner_test'` to the `triggered_by` enum on `agent_compliance_runs` (CHECK constraint or enum type — match the existing pattern).
- Add `verdict_source: 'heartbeat' | 'owner_test'` to the compliance response shape (assuming option 1 above is taken).

**Tests:**
- Owner runs `evaluate_agent_quality` → `agent_compliance_status` reflects the result; public endpoint serves the new status without waiting for the heartbeat.
- Non-owner runs `evaluate_agent_quality` against someone else's agent → `agent_compliance_status` is unchanged; result flows through the existing session-scoped path.
- `triggered_by = 'owner_test'` and `triggered_org_id` correctly populated.
- **Last-write-wins race pinned** — heartbeat then owner_test, owner_test then heartbeat, both interleavings exercised.
- `verdict_source` correctly serialized (if option 1 taken).

**User-visible win:** owner runs a test and sees the result on `/dashboard/agents` and via `/api/registry/agents/:url/compliance` immediately. Closes the 12h gap.

### PR 2 — Dashboard reads from the single source

**What changes:** `/dashboard/agents` stops reading `agent_test_history` for the "compliance" tile. Reads `agent_compliance_status` + `agent_compliance_runs` directly.

The dashboard's "your test history" view becomes a `triggered_by IN ('heartbeat', 'owner_test')` filter on `agent_compliance_runs` — same table, different lens. Owner-triggered tests appear interleaved with heartbeat runs in chronological order, distinguished by the `triggered_by` badge.

**Cohort/timestamp drift during the PR 2 soak window:** Before PR 3 backfills history, existing pre-PR-1 `agent_test_history` rows are orphans — visible neither in the canonical compliance tile nor in the new test-history strand. PR 2 ships with a one-time banner on the test-history view: *"Tests run before [PR 1 deploy date] are archived under Test History (legacy) — see [link]."* The banner stays until PR 3 backfills + drops; then it disappears automatically.

**Schema:** none.

**Tests:**
- Dashboard renders the same compliance verdict whether the last write was a heartbeat or an owner test.
- "Test history" filter shows `owner_test` runs as a distinct strand from heartbeat runs.
- Pre-PR-1 `agent_test_history` rows render under the legacy view, not the new strand.

**User-visible win:** dashboard test-history and dashboard compliance tile finally agree. No drift possible.

### PR 3 — Deprecate `agent_test_history` (destructive — explicit soak gate)

**Pre-merge gate (load-bearing):**

- [ ] PR 1 has been live in prod for **≥ 14 days** with zero canonical-write incidents (no owner_test write that produced a malformed `agent_compliance_status` row, no flap reports).
- [ ] PR 2 has been live in prod for **≥ 7 days** with the dashboard rendering identical verdicts from old vs new path on a hand-audited sample.
- [ ] **Backfill row-count delta is ±0 on staging.** Run the migration on a staging clone of prod, count `agent_test_history` rows pre-migration vs `agent_compliance_runs` rows added, assert the owner-triggered subset matches exactly.
- [ ] Third-party-row deletion volume quantified and acknowledged (see below).

**What changes:** `recordTest()` callers migrated:

- Owner-triggered paths already write to `agent_compliance_runs` (PR 1).
- Non-owner paths drop the persistent write entirely and return session-scoped results only.

After all callers migrate:

- **Backfill** owner-triggered `agent_test_history` rows → `agent_compliance_runs` (with `triggered_by = 'owner_test'` retroactively).
- **Export third-party `agent_test_history` rows to S3 cold storage before drop.** This is data, not noise — even if the new policy refuses to write more, the historical audit trail of external testing has value. One-time export, JSONL format, cataloged in ops runbook. Do not silently lose history.
- Drop `agent_test_history` table only after backfill verification + export complete.

**Schema:** drop `agent_test_history`. The backfill migration is a separate file from the drop migration (two-phase) so a partial deploy can't run drop without backfill.

**Migration safety:** treat as destructive. Two-phase migration (backfill, drop). `release_command` runs phase 1 only; phase 2 ships in a follow-up release after verification. Row-count assertion in the runner.

**Tests:** post-drop, no remaining caller of `recordTest`. Backfill correctness verified with a row-count assertion in the migration runner. Export integrity verified with a checksum of the cold-storage JSONL.

### PR 4 — Collapse `agent_contexts.last_test_*` into a derived view

**Pre-merge gate:** **Reader audit complete.** Before opening PR 4, the implementer enumerates every reader of `agent_contexts.last_test_scenario`, `last_test_passed`, `last_test_summary`, `last_tested_at`, `total_tests_run` and lists in the PR description:

| Reader | File:line | Refactor target |
|---|---|---|
| _TBD_ | _TBD_ | _read view / read `agent_compliance_runs` directly_ |

PR 4 doesn't merge until every reader is migrated to the view OR refactored to query `agent_compliance_runs` directly.

**What changes:** `agent_contexts.last_test_scenario`, `last_test_passed`, `last_test_summary`, `last_tested_at`, `total_tests_run` are removed from the table. A view (`agent_context_with_latest_test`) joins `agent_contexts` to the latest `agent_compliance_runs` row per `(org, agent_url)` to derive these fields on read.

**Schema:** drop columns from `agent_contexts`. Create view.

**Tests:** existing readers of `last_test_*` continue to work via the view (or via the direct-query refactor).

**User-visible win:** none directly — pure schema-cleanup. Ships when PR 1-3 have soaked in prod (~2 weeks past PR 3).

## Phasing

| PR | Ships when | Risk | Reversibility | Soak before next |
|---|---|---|---|---|
| **PR 1** | Immediately (closes the user-visible gap) | Low — additive write to existing canonical tables | Easy — feature-flag the owner-test write | ≥ 14 days |
| **PR 2** | After PR 1's 14-day soak | Low — read-path refactor | Easy — swap reader back | ≥ 7 days |
| **PR 3** | After PR 2's 7-day soak + backfill verified on staging + export complete | **Medium** — destructive migration | **Hard** — table drop (cold-storage export is the recovery path) | ≥ 14 days |
| **PR 4** | After PR 3's 14-day soak + reader audit | Low — pure cleanup | Easy — restore columns from view | n/a |

PR 1 ships first because it closes the actual user-visible gap (owner runs reflect immediately on the dashboard). PR 2-4 are cleanup that doesn't change UX. The destructive migration in PR 3 is the only one that needs careful staging — gated on staging-first deploy + row-count verification + S3 export.

## Acceptance criteria for the initiative

- [ ] Owner runs `evaluate_agent_quality` → public `/api/registry/agents/:url/compliance` reflects the verdict within seconds (PR 1)
- [ ] `verdict_source: 'heartbeat' | 'owner_test'` field on the compliance response so consumers can filter (PR 1, assuming option 1)
- [ ] Last-write-wins race pinned by a test (PR 1)
- [ ] Dashboard compliance tile and dashboard test history read from the same table (PR 2)
- [ ] `agent_test_history` no longer exists; `agent_compliance_runs` is the single history table (PR 3)
- [ ] Third-party `agent_test_history` rows exported to S3 cold storage before drop (PR 3)
- [ ] Every reader of `agent_contexts.last_test_*` migrated to the view or to `agent_compliance_runs` directly (PR 4)
- [ ] One read endpoint (`/api/registry/agents/:url/compliance`) reflects every kind of storyboard run, distinguished by `triggered_by` (heartbeat / owner_test)

## Refs

- PR #4235 — type-required + brand-domain backfill (the prior visibility-miss fix; same shape of bug as this one)
- The 4-table compliance write path in `compliance-heartbeat.ts` and `compliance-db.ts`
- The 2-table Addie write path in `agent-context-db.ts:recordTest()` and the `evaluate_agent_quality` MCP handler

Table	Row count
`agent_registry_metadata`	0
`agent_compliance_status`	0
`agent_compliance_runs`	0
`agent_storyboard_status`	0
`agent_capabilities_snapshot`	1 (`inferred_type: sales`)
`agent_health_snapshot`	1 (`online: true`, 147ms)
`agent_verification_badges`	0
`discovered_agents`	0
`agent_contexts`	1 (`total_tests_run: 0`)
`agent_test_history`	0

Metric	Value
`recordTest()` call sites	TBD — `git grep -n "recordTest" server/src` and list
`agent_test_history` row count (owner-triggered)	TBD — `SELECT COUNT() FROM agent_test_history WHERE triggered_by = 'user'`*
`agent_test_history` row count (third-party)	TBD — same, `triggered_by != 'user'`
`agent_test_history` total	TBD
`agent_contexts.last_test_*` reader call sites	TBD — `git grep -n "last_test_" server/src`
DB writes added per owner test (PR 1)	TBD — count INSERT/UPDATE statements in `complianceResultToDbInput` path
Existing tests covering `agent_compliance_status` write path	TBD — `grep -rn "agent_compliance_status" server/tests`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247

The bug — concrete narrative

The split

Goal

Quantify the work — fill in before opening PR 1

Tradeoff settled — option 1 (owner-only writes to canonical state)

Frozen reporting contract — semantic shift on `/api/registry/agents/:url/compliance`

Last-write-wins race rule — pin in PR 1

Plan — four small PRs, stacked

PR 1 — Owner-triggered storyboard runs unify with heartbeat writes

PR 2 — Dashboard reads from the single source

PR 3 — Deprecate `agent_test_history` (destructive — explicit soak gate)

PR 4 — Collapse `agent_contexts.last_test_*` into a derived view

Phasing

Acceptance criteria for the initiative

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Write path	Tables	Triggered by	Visible to `/api/registry/agents/:url/compliance`?
Heartbeat (`compliance-heartbeat.ts`)	`agent_compliance_status`, `agent_compliance_runs`, `agent_storyboard_status`, `agent_registry_metadata`	The 12h cron	✅
Addie on-demand (`evaluate_agent_quality`, `run_storyboard`)	`agent_contexts`, `agent_test_history` (via `recordTest`)	Owner clicking Run / asking Addie / `evaluate_agent_quality` MCP call	❌

PR	Ships when	Risk	Reversibility	Soak before next
PR 1	Immediately (closes the user-visible gap)	Low — additive write to existing canonical tables	Easy — feature-flag the owner-test write	≥ 14 days
PR 2	After PR 1's 14-day soak	Low — read-path refactor	Easy — swap reader back	≥ 7 days
PR 3	After PR 2's 7-day soak + backfill verified on staging + export complete	Medium — destructive migration	Hard — table drop (cold-storage export is the recovery path)	≥ 14 days
PR 4	After PR 3's 14-day soak + reader audit	Low — pure cleanup	Easy — restore columns from view	n/a

Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247

Description

The bug — concrete narrative

The split

Goal

Quantify the work — fill in before opening PR 1

Tradeoff settled — option 1 (owner-only writes to canonical state)

Frozen reporting contract — semantic shift on /api/registry/agents/:url/compliance

Last-write-wins race rule — pin in PR 1

Plan — four small PRs, stacked

PR 1 — Owner-triggered storyboard runs unify with heartbeat writes

PR 2 — Dashboard reads from the single source

PR 3 — Deprecate agent_test_history (destructive — explicit soak gate)

PR 4 — Collapse agent_contexts.last_test_* into a derived view

Phasing

Acceptance criteria for the initiative

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Frozen reporting contract — semantic shift on `/api/registry/agents/:url/compliance`

PR 3 — Deprecate `agent_test_history` (destructive — explicit soak gate)

PR 4 — Collapse `agent_contexts.last_test_*` into a derived view