The bug — concrete narrative
A member registered an agent at https://www.harvingupta.xyz/api/mcp through Addie. After registration:
/dashboard/agents rendered the agent card with Run / Preview buttons next to every storyboard the registry catalog knows about (Capability discovery, Pagination cursor integrity, Determinism testing, etc.).
GET /api/registry/agents/<url>/compliance returned { "status": "unknown", "last_checked_at": null, "tracks": {} }.
The owner's reasonable assumption: "the dashboard is showing me storyboards I'm running, why does the API say unknown?"
Pod audit (harvingupta.xyz/api/mcp):
| Table |
Row count |
agent_registry_metadata |
0 |
agent_compliance_status |
0 |
agent_compliance_runs |
0 |
agent_storyboard_status |
0 |
agent_capabilities_snapshot |
1 (inferred_type: sales) |
agent_health_snapshot |
1 (online: true, 147ms) |
agent_verification_badges |
0 |
discovered_agents |
0 |
agent_contexts |
1 (total_tests_run: 0) |
agent_test_history |
0 |
So in this concrete case nobody had actually run a storyboard yet — but the bug is structural regardless: state is split across two write paths and one read path, and the read path reflects only one of the writers.
The split
| Write path |
Tables |
Triggered by |
Visible to /api/registry/agents/:url/compliance? |
Heartbeat (compliance-heartbeat.ts) |
agent_compliance_status, agent_compliance_runs, agent_storyboard_status, agent_registry_metadata |
The 12h cron |
✅ |
Addie on-demand (evaluate_agent_quality, run_storyboard) |
agent_contexts, agent_test_history (via recordTest) |
Owner clicking Run / asking Addie / evaluate_agent_quality MCP call |
❌ |
Same activity (running a storyboard against an agent), different table, different read endpoint. The dashboard's compliance tile reads the first path; "your test history" reads the second; the public API reads the first only.
Three concrete consequences:
- An owner who runs
evaluate_agent_quality from Addie chat sees results in chat, but the public registry keeps saying unknown until the heartbeat runs ~12h later.
- The dashboard's compliance tile and the dashboard's test history are reading disjoint state — if a row in one path drifts from the other, neither is canonical.
agent_contexts.last_test_* duplicates agent_test_history which duplicates what agent_compliance_status would say if the heartbeat had run. Three copies, three slightly different lifecycles.
Goal
One canonical compliance state. Every storyboard run — heartbeat, owner-triggered, dashboard-Run-button, Addie evaluate_agent_quality — writes to the same canonical tables (agent_compliance_status + agent_compliance_runs + agent_storyboard_status), distinguished by triggered_by and triggered_org_id, never by which table got written. Dashboard, public registry, and Addie all read from that one place.
Drop agent_test_history once writes are unified. Keep agent_contexts for what it actually carries (auth credentials, saved per-org context); stop using its last_test_* columns as duplicate state — derive them from a join against agent_compliance_runs instead.
Out of scope for this initiative: agent_capabilities_snapshot, agent_health_snapshot, discovered_agents, agent_verification_badges. Those track genuinely different lifecycle events (capability probe, liveness, crawler discovery, earned badges) and should stay separate. The unification is specifically about storyboard run results.
Quantify the work — fill in before opening PR 1
Before any PR opens, the implementer fills in:
| Metric |
Value |
recordTest() call sites |
TBD — git grep -n "recordTest" server/src and list |
agent_test_history row count (owner-triggered) |
TBD — SELECT COUNT(*) FROM agent_test_history WHERE triggered_by = 'user' |
agent_test_history row count (third-party) |
TBD — same, triggered_by != 'user' |
agent_test_history total |
TBD |
agent_contexts.last_test_* reader call sites |
TBD — git grep -n "last_test_" server/src |
| DB writes added per owner test (PR 1) |
TBD — count INSERT/UPDATE statements in complianceResultToDbInput path |
Existing tests covering agent_compliance_status write path |
TBD — grep -rn "agent_compliance_status" server/tests |
Numbers anchor the conversation. Don't open PR 1 until this table is filled.
Tradeoff settled — option 1 (owner-only writes to canonical state)
Non-owner runs return a session-scoped result in the response and never persist to agent_compliance_status. Owner-triggered runs (resolved via existing resolveOwnerMembership) write canonical. Fail-closed beats fail-open; respects the audit-grade-public-record invariant.
Frozen reporting contract — semantic shift on /api/registry/agents/:url/compliance
This endpoint is a public registry surface. What it reflects is changing, even though the field names aren't:
- Pre-fix: "this agent's last scheduled verdict" (heartbeat-only).
- Post-fix: "this agent's last verdict from any source" (heartbeat + owner_test).
A consumer who scraped this endpoint daily learned heartbeat history; post-fix they see any verdict, possibly an owner running it on demand to test a fix. Tell the caller, don't guess silently. PR 1 MUST do one of:
- Add
verdict_source: 'heartbeat' | 'owner_test' to the response shape so consumers can filter. Recommended.
- Add a SHOULD-clause to the changeset explicitly naming the semantic shift for downstream scrapers, with a docs page link.
PR 1's PR description picks one and locks it in. The OperatorLookupResultSchema analog for compliance gets the additive field if option 1 wins.
Last-write-wins race rule — pin in PR 1
Heartbeat and owner_test can fire within minutes. Two concurrent writers, one row in agent_compliance_status. The unification's conflict-resolution rule is last-write-wins on (agent_url): the latest verdict wins, regardless of source.
PR 1 includes a test that pins this:
// heartbeat at T, owner_test at T+1s, endpoint reflects T+1s
// then owner_test at T+2s, heartbeat at T+3s, endpoint reflects T+3s
A future refactor that switches to "first-write-wins" or "merge" silently changes the public contract. The test is the contract.
Plan — four small PRs, stacked
PR 1 — Owner-triggered storyboard runs unify with heartbeat writes
What changes: evaluate_agent_quality and run_storyboard (when triggered by an owner of the agent) call the same DB writes the heartbeat does:
complianceResultToDbInput() shared between both paths.
agent_compliance_status upserted with the new verdict.
agent_compliance_runs row inserted with triggered_by = 'owner_test' + triggered_org_id = <owner-org>.
agent_storyboard_status rows updated per storyboard.
- Ownership resolution via existing
resolveOwnerMembership — non-owner runs short-circuit to session-scoped (current agent_test_history write path retained for non-owner runs in this PR; deprecated in PR 3).
Schema:
- Add
'owner_test' to the triggered_by enum on agent_compliance_runs (CHECK constraint or enum type — match the existing pattern).
- Add
verdict_source: 'heartbeat' | 'owner_test' to the compliance response shape (assuming option 1 above is taken).
Tests:
- Owner runs
evaluate_agent_quality → agent_compliance_status reflects the result; public endpoint serves the new status without waiting for the heartbeat.
- Non-owner runs
evaluate_agent_quality against someone else's agent → agent_compliance_status is unchanged; result flows through the existing session-scoped path.
triggered_by = 'owner_test' and triggered_org_id correctly populated.
- Last-write-wins race pinned — heartbeat then owner_test, owner_test then heartbeat, both interleavings exercised.
verdict_source correctly serialized (if option 1 taken).
User-visible win: owner runs a test and sees the result on /dashboard/agents and via /api/registry/agents/:url/compliance immediately. Closes the 12h gap.
PR 2 — Dashboard reads from the single source
What changes: /dashboard/agents stops reading agent_test_history for the "compliance" tile. Reads agent_compliance_status + agent_compliance_runs directly.
The dashboard's "your test history" view becomes a triggered_by IN ('heartbeat', 'owner_test') filter on agent_compliance_runs — same table, different lens. Owner-triggered tests appear interleaved with heartbeat runs in chronological order, distinguished by the triggered_by badge.
Cohort/timestamp drift during the PR 2 soak window: Before PR 3 backfills history, existing pre-PR-1 agent_test_history rows are orphans — visible neither in the canonical compliance tile nor in the new test-history strand. PR 2 ships with a one-time banner on the test-history view: "Tests run before [PR 1 deploy date] are archived under Test History (legacy) — see [link]." The banner stays until PR 3 backfills + drops; then it disappears automatically.
Schema: none.
Tests:
- Dashboard renders the same compliance verdict whether the last write was a heartbeat or an owner test.
- "Test history" filter shows
owner_test runs as a distinct strand from heartbeat runs.
- Pre-PR-1
agent_test_history rows render under the legacy view, not the new strand.
User-visible win: dashboard test-history and dashboard compliance tile finally agree. No drift possible.
PR 3 — Deprecate agent_test_history (destructive — explicit soak gate)
Pre-merge gate (load-bearing):
What changes: recordTest() callers migrated:
- Owner-triggered paths already write to
agent_compliance_runs (PR 1).
- Non-owner paths drop the persistent write entirely and return session-scoped results only.
After all callers migrate:
- Backfill owner-triggered
agent_test_history rows → agent_compliance_runs (with triggered_by = 'owner_test' retroactively).
- Export third-party
agent_test_history rows to S3 cold storage before drop. This is data, not noise — even if the new policy refuses to write more, the historical audit trail of external testing has value. One-time export, JSONL format, cataloged in ops runbook. Do not silently lose history.
- Drop
agent_test_history table only after backfill verification + export complete.
Schema: drop agent_test_history. The backfill migration is a separate file from the drop migration (two-phase) so a partial deploy can't run drop without backfill.
Migration safety: treat as destructive. Two-phase migration (backfill, drop). release_command runs phase 1 only; phase 2 ships in a follow-up release after verification. Row-count assertion in the runner.
Tests: post-drop, no remaining caller of recordTest. Backfill correctness verified with a row-count assertion in the migration runner. Export integrity verified with a checksum of the cold-storage JSONL.
PR 4 — Collapse agent_contexts.last_test_* into a derived view
Pre-merge gate: Reader audit complete. Before opening PR 4, the implementer enumerates every reader of agent_contexts.last_test_scenario, last_test_passed, last_test_summary, last_tested_at, total_tests_run and lists in the PR description:
| Reader |
File:line |
Refactor target |
| TBD |
TBD |
read view / read agent_compliance_runs directly |
PR 4 doesn't merge until every reader is migrated to the view OR refactored to query agent_compliance_runs directly.
What changes: agent_contexts.last_test_scenario, last_test_passed, last_test_summary, last_tested_at, total_tests_run are removed from the table. A view (agent_context_with_latest_test) joins agent_contexts to the latest agent_compliance_runs row per (org, agent_url) to derive these fields on read.
Schema: drop columns from agent_contexts. Create view.
Tests: existing readers of last_test_* continue to work via the view (or via the direct-query refactor).
User-visible win: none directly — pure schema-cleanup. Ships when PR 1-3 have soaked in prod (~2 weeks past PR 3).
Phasing
| PR |
Ships when |
Risk |
Reversibility |
Soak before next |
| PR 1 |
Immediately (closes the user-visible gap) |
Low — additive write to existing canonical tables |
Easy — feature-flag the owner-test write |
≥ 14 days |
| PR 2 |
After PR 1's 14-day soak |
Low — read-path refactor |
Easy — swap reader back |
≥ 7 days |
| PR 3 |
After PR 2's 7-day soak + backfill verified on staging + export complete |
Medium — destructive migration |
Hard — table drop (cold-storage export is the recovery path) |
≥ 14 days |
| PR 4 |
After PR 3's 14-day soak + reader audit |
Low — pure cleanup |
Easy — restore columns from view |
n/a |
PR 1 ships first because it closes the actual user-visible gap (owner runs reflect immediately on the dashboard). PR 2-4 are cleanup that doesn't change UX. The destructive migration in PR 3 is the only one that needs careful staging — gated on staging-first deploy + row-count verification + S3 export.
Acceptance criteria for the initiative
Refs
The bug — concrete narrative
A member registered an agent at
https://www.harvingupta.xyz/api/mcpthrough Addie. After registration:/dashboard/agentsrendered the agent card with Run / Preview buttons next to every storyboard the registry catalog knows about (Capability discovery, Pagination cursor integrity, Determinism testing, etc.).GET /api/registry/agents/<url>/compliancereturned{ "status": "unknown", "last_checked_at": null, "tracks": {} }.The owner's reasonable assumption: "the dashboard is showing me storyboards I'm running, why does the API say
unknown?"Pod audit (
harvingupta.xyz/api/mcp):agent_registry_metadataagent_compliance_statusagent_compliance_runsagent_storyboard_statusagent_capabilities_snapshotinferred_type: sales)agent_health_snapshotonline: true, 147ms)agent_verification_badgesdiscovered_agentsagent_contextstotal_tests_run: 0)agent_test_historySo in this concrete case nobody had actually run a storyboard yet — but the bug is structural regardless: state is split across two write paths and one read path, and the read path reflects only one of the writers.
The split
/api/registry/agents/:url/compliance?compliance-heartbeat.ts)agent_compliance_status,agent_compliance_runs,agent_storyboard_status,agent_registry_metadataevaluate_agent_quality,run_storyboard)agent_contexts,agent_test_history(viarecordTest)evaluate_agent_qualityMCP callSame activity (running a storyboard against an agent), different table, different read endpoint. The dashboard's compliance tile reads the first path; "your test history" reads the second; the public API reads the first only.
Three concrete consequences:
evaluate_agent_qualityfrom Addie chat sees results in chat, but the public registry keeps sayingunknownuntil the heartbeat runs ~12h later.agent_contexts.last_test_*duplicatesagent_test_historywhich duplicates whatagent_compliance_statuswould say if the heartbeat had run. Three copies, three slightly different lifecycles.Goal
One canonical compliance state. Every storyboard run — heartbeat, owner-triggered, dashboard-Run-button, Addie
evaluate_agent_quality— writes to the same canonical tables (agent_compliance_status+agent_compliance_runs+agent_storyboard_status), distinguished bytriggered_byandtriggered_org_id, never by which table got written. Dashboard, public registry, and Addie all read from that one place.Drop
agent_test_historyonce writes are unified. Keepagent_contextsfor what it actually carries (auth credentials, saved per-org context); stop using itslast_test_*columns as duplicate state — derive them from a join againstagent_compliance_runsinstead.Out of scope for this initiative:
agent_capabilities_snapshot,agent_health_snapshot,discovered_agents,agent_verification_badges. Those track genuinely different lifecycle events (capability probe, liveness, crawler discovery, earned badges) and should stay separate. The unification is specifically about storyboard run results.Quantify the work — fill in before opening PR 1
Before any PR opens, the implementer fills in:
recordTest()call sitesgit grep -n "recordTest" server/srcand listagent_test_historyrow count (owner-triggered)SELECT COUNT(*) FROM agent_test_history WHERE triggered_by = 'user'agent_test_historyrow count (third-party)triggered_by != 'user'agent_test_historytotalagent_contexts.last_test_*reader call sitesgit grep -n "last_test_" server/srccomplianceResultToDbInputpathagent_compliance_statuswrite pathgrep -rn "agent_compliance_status" server/testsNumbers anchor the conversation. Don't open PR 1 until this table is filled.
Tradeoff settled — option 1 (owner-only writes to canonical state)
Non-owner runs return a session-scoped result in the response and never persist to
agent_compliance_status. Owner-triggered runs (resolved via existingresolveOwnerMembership) write canonical. Fail-closed beats fail-open; respects the audit-grade-public-record invariant.Frozen reporting contract — semantic shift on
/api/registry/agents/:url/complianceThis endpoint is a public registry surface. What it reflects is changing, even though the field names aren't:
A consumer who scraped this endpoint daily learned heartbeat history; post-fix they see any verdict, possibly an owner running it on demand to test a fix. Tell the caller, don't guess silently. PR 1 MUST do one of:
verdict_source: 'heartbeat' | 'owner_test'to the response shape so consumers can filter. Recommended.PR 1's PR description picks one and locks it in. The
OperatorLookupResultSchemaanalog for compliance gets the additive field if option 1 wins.Last-write-wins race rule — pin in PR 1
Heartbeat and owner_test can fire within minutes. Two concurrent writers, one row in
agent_compliance_status. The unification's conflict-resolution rule is last-write-wins on(agent_url): the latest verdict wins, regardless of source.PR 1 includes a test that pins this:
A future refactor that switches to "first-write-wins" or "merge" silently changes the public contract. The test is the contract.
Plan — four small PRs, stacked
PR 1 — Owner-triggered storyboard runs unify with heartbeat writes
What changes:
evaluate_agent_qualityandrun_storyboard(when triggered by an owner of the agent) call the same DB writes the heartbeat does:complianceResultToDbInput()shared between both paths.agent_compliance_statusupserted with the new verdict.agent_compliance_runsrow inserted withtriggered_by = 'owner_test'+triggered_org_id = <owner-org>.agent_storyboard_statusrows updated per storyboard.resolveOwnerMembership— non-owner runs short-circuit to session-scoped (currentagent_test_historywrite path retained for non-owner runs in this PR; deprecated in PR 3).Schema:
'owner_test'to thetriggered_byenum onagent_compliance_runs(CHECK constraint or enum type — match the existing pattern).verdict_source: 'heartbeat' | 'owner_test'to the compliance response shape (assuming option 1 above is taken).Tests:
evaluate_agent_quality→agent_compliance_statusreflects the result; public endpoint serves the new status without waiting for the heartbeat.evaluate_agent_qualityagainst someone else's agent →agent_compliance_statusis unchanged; result flows through the existing session-scoped path.triggered_by = 'owner_test'andtriggered_org_idcorrectly populated.verdict_sourcecorrectly serialized (if option 1 taken).User-visible win: owner runs a test and sees the result on
/dashboard/agentsand via/api/registry/agents/:url/complianceimmediately. Closes the 12h gap.PR 2 — Dashboard reads from the single source
What changes:
/dashboard/agentsstops readingagent_test_historyfor the "compliance" tile. Readsagent_compliance_status+agent_compliance_runsdirectly.The dashboard's "your test history" view becomes a
triggered_by IN ('heartbeat', 'owner_test')filter onagent_compliance_runs— same table, different lens. Owner-triggered tests appear interleaved with heartbeat runs in chronological order, distinguished by thetriggered_bybadge.Cohort/timestamp drift during the PR 2 soak window: Before PR 3 backfills history, existing pre-PR-1
agent_test_historyrows are orphans — visible neither in the canonical compliance tile nor in the new test-history strand. PR 2 ships with a one-time banner on the test-history view: "Tests run before [PR 1 deploy date] are archived under Test History (legacy) — see [link]." The banner stays until PR 3 backfills + drops; then it disappears automatically.Schema: none.
Tests:
owner_testruns as a distinct strand from heartbeat runs.agent_test_historyrows render under the legacy view, not the new strand.User-visible win: dashboard test-history and dashboard compliance tile finally agree. No drift possible.
PR 3 — Deprecate
agent_test_history(destructive — explicit soak gate)Pre-merge gate (load-bearing):
agent_compliance_statusrow, no flap reports).agent_test_historyrows pre-migration vsagent_compliance_runsrows added, assert the owner-triggered subset matches exactly.What changes:
recordTest()callers migrated:agent_compliance_runs(PR 1).After all callers migrate:
agent_test_historyrows →agent_compliance_runs(withtriggered_by = 'owner_test'retroactively).agent_test_historyrows to S3 cold storage before drop. This is data, not noise — even if the new policy refuses to write more, the historical audit trail of external testing has value. One-time export, JSONL format, cataloged in ops runbook. Do not silently lose history.agent_test_historytable only after backfill verification + export complete.Schema: drop
agent_test_history. The backfill migration is a separate file from the drop migration (two-phase) so a partial deploy can't run drop without backfill.Migration safety: treat as destructive. Two-phase migration (backfill, drop).
release_commandruns phase 1 only; phase 2 ships in a follow-up release after verification. Row-count assertion in the runner.Tests: post-drop, no remaining caller of
recordTest. Backfill correctness verified with a row-count assertion in the migration runner. Export integrity verified with a checksum of the cold-storage JSONL.PR 4 — Collapse
agent_contexts.last_test_*into a derived viewPre-merge gate: Reader audit complete. Before opening PR 4, the implementer enumerates every reader of
agent_contexts.last_test_scenario,last_test_passed,last_test_summary,last_tested_at,total_tests_runand lists in the PR description:agent_compliance_runsdirectlyPR 4 doesn't merge until every reader is migrated to the view OR refactored to query
agent_compliance_runsdirectly.What changes:
agent_contexts.last_test_scenario,last_test_passed,last_test_summary,last_tested_at,total_tests_runare removed from the table. A view (agent_context_with_latest_test) joinsagent_contextsto the latestagent_compliance_runsrow per(org, agent_url)to derive these fields on read.Schema: drop columns from
agent_contexts. Create view.Tests: existing readers of
last_test_*continue to work via the view (or via the direct-query refactor).User-visible win: none directly — pure schema-cleanup. Ships when PR 1-3 have soaked in prod (~2 weeks past PR 3).
Phasing
PR 1 ships first because it closes the actual user-visible gap (owner runs reflect immediately on the dashboard). PR 2-4 are cleanup that doesn't change UX. The destructive migration in PR 3 is the only one that needs careful staging — gated on staging-first deploy + row-count verification + S3 export.
Acceptance criteria for the initiative
evaluate_agent_quality→ public/api/registry/agents/:url/compliancereflects the verdict within seconds (PR 1)verdict_source: 'heartbeat' | 'owner_test'field on the compliance response so consumers can filter (PR 1, assuming option 1)agent_test_historyno longer exists;agent_compliance_runsis the single history table (PR 3)agent_test_historyrows exported to S3 cold storage before drop (PR 3)agent_contexts.last_test_*migrated to the view or toagent_compliance_runsdirectly (PR 4)/api/registry/agents/:url/compliance) reflects every kind of storyboard run, distinguished bytriggered_by(heartbeat / owner_test)Refs
compliance-heartbeat.tsandcompliance-db.tsagent-context-db.ts:recordTest()and theevaluate_agent_qualityMCP handler