Skip to content

Unify compliance state: every storyboard run writes to one canonical path (heartbeat + Addie + dashboard tests) #4247

@EmmaLouise2018

Description

@EmmaLouise2018

The bug — concrete narrative

A member registered an agent at https://www.harvingupta.xyz/api/mcp through Addie. After registration:

  • /dashboard/agents rendered the agent card with Run / Preview buttons next to every storyboard the registry catalog knows about (Capability discovery, Pagination cursor integrity, Determinism testing, etc.).
  • GET /api/registry/agents/<url>/compliance returned { "status": "unknown", "last_checked_at": null, "tracks": {} }.

The owner's reasonable assumption: "the dashboard is showing me storyboards I'm running, why does the API say unknown?"

Pod audit (harvingupta.xyz/api/mcp):

Table Row count
agent_registry_metadata 0
agent_compliance_status 0
agent_compliance_runs 0
agent_storyboard_status 0
agent_capabilities_snapshot 1 (inferred_type: sales)
agent_health_snapshot 1 (online: true, 147ms)
agent_verification_badges 0
discovered_agents 0
agent_contexts 1 (total_tests_run: 0)
agent_test_history 0

So in this concrete case nobody had actually run a storyboard yet — but the bug is structural regardless: state is split across two write paths and one read path, and the read path reflects only one of the writers.

The split

Write path Tables Triggered by Visible to /api/registry/agents/:url/compliance?
Heartbeat (compliance-heartbeat.ts) agent_compliance_status, agent_compliance_runs, agent_storyboard_status, agent_registry_metadata The 12h cron
Addie on-demand (evaluate_agent_quality, run_storyboard) agent_contexts, agent_test_history (via recordTest) Owner clicking Run / asking Addie / evaluate_agent_quality MCP call

Same activity (running a storyboard against an agent), different table, different read endpoint. The dashboard's compliance tile reads the first path; "your test history" reads the second; the public API reads the first only.

Three concrete consequences:

  1. An owner who runs evaluate_agent_quality from Addie chat sees results in chat, but the public registry keeps saying unknown until the heartbeat runs ~12h later.
  2. The dashboard's compliance tile and the dashboard's test history are reading disjoint state — if a row in one path drifts from the other, neither is canonical.
  3. agent_contexts.last_test_* duplicates agent_test_history which duplicates what agent_compliance_status would say if the heartbeat had run. Three copies, three slightly different lifecycles.

Goal

One canonical compliance state. Every storyboard run — heartbeat, owner-triggered, dashboard-Run-button, Addie evaluate_agent_quality — writes to the same canonical tables (agent_compliance_status + agent_compliance_runs + agent_storyboard_status), distinguished by triggered_by and triggered_org_id, never by which table got written. Dashboard, public registry, and Addie all read from that one place.

Drop agent_test_history once writes are unified. Keep agent_contexts for what it actually carries (auth credentials, saved per-org context); stop using its last_test_* columns as duplicate state — derive them from a join against agent_compliance_runs instead.

Out of scope for this initiative: agent_capabilities_snapshot, agent_health_snapshot, discovered_agents, agent_verification_badges. Those track genuinely different lifecycle events (capability probe, liveness, crawler discovery, earned badges) and should stay separate. The unification is specifically about storyboard run results.

Quantify the work — fill in before opening PR 1

Before any PR opens, the implementer fills in:

Metric Value
recordTest() call sites TBD — git grep -n "recordTest" server/src and list
agent_test_history row count (owner-triggered) TBD — SELECT COUNT(*) FROM agent_test_history WHERE triggered_by = 'user'
agent_test_history row count (third-party) TBD — same, triggered_by != 'user'
agent_test_history total TBD
agent_contexts.last_test_* reader call sites TBD — git grep -n "last_test_" server/src
DB writes added per owner test (PR 1) TBD — count INSERT/UPDATE statements in complianceResultToDbInput path
Existing tests covering agent_compliance_status write path TBD — grep -rn "agent_compliance_status" server/tests

Numbers anchor the conversation. Don't open PR 1 until this table is filled.

Tradeoff settled — option 1 (owner-only writes to canonical state)

Non-owner runs return a session-scoped result in the response and never persist to agent_compliance_status. Owner-triggered runs (resolved via existing resolveOwnerMembership) write canonical. Fail-closed beats fail-open; respects the audit-grade-public-record invariant.

Frozen reporting contract — semantic shift on /api/registry/agents/:url/compliance

This endpoint is a public registry surface. What it reflects is changing, even though the field names aren't:

  • Pre-fix: "this agent's last scheduled verdict" (heartbeat-only).
  • Post-fix: "this agent's last verdict from any source" (heartbeat + owner_test).

A consumer who scraped this endpoint daily learned heartbeat history; post-fix they see any verdict, possibly an owner running it on demand to test a fix. Tell the caller, don't guess silently. PR 1 MUST do one of:

  1. Add verdict_source: 'heartbeat' | 'owner_test' to the response shape so consumers can filter. Recommended.
  2. Add a SHOULD-clause to the changeset explicitly naming the semantic shift for downstream scrapers, with a docs page link.

PR 1's PR description picks one and locks it in. The OperatorLookupResultSchema analog for compliance gets the additive field if option 1 wins.

Last-write-wins race rule — pin in PR 1

Heartbeat and owner_test can fire within minutes. Two concurrent writers, one row in agent_compliance_status. The unification's conflict-resolution rule is last-write-wins on (agent_url): the latest verdict wins, regardless of source.

PR 1 includes a test that pins this:

// heartbeat at T, owner_test at T+1s, endpoint reflects T+1s
// then owner_test at T+2s, heartbeat at T+3s, endpoint reflects T+3s

A future refactor that switches to "first-write-wins" or "merge" silently changes the public contract. The test is the contract.

Plan — four small PRs, stacked

PR 1 — Owner-triggered storyboard runs unify with heartbeat writes

What changes: evaluate_agent_quality and run_storyboard (when triggered by an owner of the agent) call the same DB writes the heartbeat does:

  • complianceResultToDbInput() shared between both paths.
  • agent_compliance_status upserted with the new verdict.
  • agent_compliance_runs row inserted with triggered_by = 'owner_test' + triggered_org_id = <owner-org>.
  • agent_storyboard_status rows updated per storyboard.
  • Ownership resolution via existing resolveOwnerMembership — non-owner runs short-circuit to session-scoped (current agent_test_history write path retained for non-owner runs in this PR; deprecated in PR 3).

Schema:

  • Add 'owner_test' to the triggered_by enum on agent_compliance_runs (CHECK constraint or enum type — match the existing pattern).
  • Add verdict_source: 'heartbeat' | 'owner_test' to the compliance response shape (assuming option 1 above is taken).

Tests:

  • Owner runs evaluate_agent_qualityagent_compliance_status reflects the result; public endpoint serves the new status without waiting for the heartbeat.
  • Non-owner runs evaluate_agent_quality against someone else's agent → agent_compliance_status is unchanged; result flows through the existing session-scoped path.
  • triggered_by = 'owner_test' and triggered_org_id correctly populated.
  • Last-write-wins race pinned — heartbeat then owner_test, owner_test then heartbeat, both interleavings exercised.
  • verdict_source correctly serialized (if option 1 taken).

User-visible win: owner runs a test and sees the result on /dashboard/agents and via /api/registry/agents/:url/compliance immediately. Closes the 12h gap.

PR 2 — Dashboard reads from the single source

What changes: /dashboard/agents stops reading agent_test_history for the "compliance" tile. Reads agent_compliance_status + agent_compliance_runs directly.

The dashboard's "your test history" view becomes a triggered_by IN ('heartbeat', 'owner_test') filter on agent_compliance_runs — same table, different lens. Owner-triggered tests appear interleaved with heartbeat runs in chronological order, distinguished by the triggered_by badge.

Cohort/timestamp drift during the PR 2 soak window: Before PR 3 backfills history, existing pre-PR-1 agent_test_history rows are orphans — visible neither in the canonical compliance tile nor in the new test-history strand. PR 2 ships with a one-time banner on the test-history view: "Tests run before [PR 1 deploy date] are archived under Test History (legacy) — see [link]." The banner stays until PR 3 backfills + drops; then it disappears automatically.

Schema: none.

Tests:

  • Dashboard renders the same compliance verdict whether the last write was a heartbeat or an owner test.
  • "Test history" filter shows owner_test runs as a distinct strand from heartbeat runs.
  • Pre-PR-1 agent_test_history rows render under the legacy view, not the new strand.

User-visible win: dashboard test-history and dashboard compliance tile finally agree. No drift possible.

PR 3 — Deprecate agent_test_history (destructive — explicit soak gate)

Pre-merge gate (load-bearing):

  • PR 1 has been live in prod for ≥ 14 days with zero canonical-write incidents (no owner_test write that produced a malformed agent_compliance_status row, no flap reports).
  • PR 2 has been live in prod for ≥ 7 days with the dashboard rendering identical verdicts from old vs new path on a hand-audited sample.
  • Backfill row-count delta is ±0 on staging. Run the migration on a staging clone of prod, count agent_test_history rows pre-migration vs agent_compliance_runs rows added, assert the owner-triggered subset matches exactly.
  • Third-party-row deletion volume quantified and acknowledged (see below).

What changes: recordTest() callers migrated:

  • Owner-triggered paths already write to agent_compliance_runs (PR 1).
  • Non-owner paths drop the persistent write entirely and return session-scoped results only.

After all callers migrate:

  • Backfill owner-triggered agent_test_history rows → agent_compliance_runs (with triggered_by = 'owner_test' retroactively).
  • Export third-party agent_test_history rows to S3 cold storage before drop. This is data, not noise — even if the new policy refuses to write more, the historical audit trail of external testing has value. One-time export, JSONL format, cataloged in ops runbook. Do not silently lose history.
  • Drop agent_test_history table only after backfill verification + export complete.

Schema: drop agent_test_history. The backfill migration is a separate file from the drop migration (two-phase) so a partial deploy can't run drop without backfill.

Migration safety: treat as destructive. Two-phase migration (backfill, drop). release_command runs phase 1 only; phase 2 ships in a follow-up release after verification. Row-count assertion in the runner.

Tests: post-drop, no remaining caller of recordTest. Backfill correctness verified with a row-count assertion in the migration runner. Export integrity verified with a checksum of the cold-storage JSONL.

PR 4 — Collapse agent_contexts.last_test_* into a derived view

Pre-merge gate: Reader audit complete. Before opening PR 4, the implementer enumerates every reader of agent_contexts.last_test_scenario, last_test_passed, last_test_summary, last_tested_at, total_tests_run and lists in the PR description:

Reader File:line Refactor target
TBD TBD read view / read agent_compliance_runs directly

PR 4 doesn't merge until every reader is migrated to the view OR refactored to query agent_compliance_runs directly.

What changes: agent_contexts.last_test_scenario, last_test_passed, last_test_summary, last_tested_at, total_tests_run are removed from the table. A view (agent_context_with_latest_test) joins agent_contexts to the latest agent_compliance_runs row per (org, agent_url) to derive these fields on read.

Schema: drop columns from agent_contexts. Create view.

Tests: existing readers of last_test_* continue to work via the view (or via the direct-query refactor).

User-visible win: none directly — pure schema-cleanup. Ships when PR 1-3 have soaked in prod (~2 weeks past PR 3).

Phasing

PR Ships when Risk Reversibility Soak before next
PR 1 Immediately (closes the user-visible gap) Low — additive write to existing canonical tables Easy — feature-flag the owner-test write ≥ 14 days
PR 2 After PR 1's 14-day soak Low — read-path refactor Easy — swap reader back ≥ 7 days
PR 3 After PR 2's 7-day soak + backfill verified on staging + export complete Medium — destructive migration Hard — table drop (cold-storage export is the recovery path) ≥ 14 days
PR 4 After PR 3's 14-day soak + reader audit Low — pure cleanup Easy — restore columns from view n/a

PR 1 ships first because it closes the actual user-visible gap (owner runs reflect immediately on the dashboard). PR 2-4 are cleanup that doesn't change UX. The destructive migration in PR 3 is the only one that needs careful staging — gated on staging-first deploy + row-count verification + S3 export.

Acceptance criteria for the initiative

  • Owner runs evaluate_agent_quality → public /api/registry/agents/:url/compliance reflects the verdict within seconds (PR 1)
  • verdict_source: 'heartbeat' | 'owner_test' field on the compliance response so consumers can filter (PR 1, assuming option 1)
  • Last-write-wins race pinned by a test (PR 1)
  • Dashboard compliance tile and dashboard test history read from the same table (PR 2)
  • agent_test_history no longer exists; agent_compliance_runs is the single history table (PR 3)
  • Third-party agent_test_history rows exported to S3 cold storage before drop (PR 3)
  • Every reader of agent_contexts.last_test_* migrated to the view or to agent_compliance_runs directly (PR 4)
  • One read endpoint (/api/registry/agents/:url/compliance) reflects every kind of storyboard run, distinguished by triggered_by (heartbeat / owner_test)

Refs

Metadata

Metadata

Labels

addieIssues related to Addie (via any channel)claude-triagedIssue has been triaged by the Claude Code triage routine. Remove to re-triage.compliance-suite

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions