From de5d235a275f7b5abbb3d10a24273a76bc86ebe7 Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:16:13 -0700
Subject: [PATCH 1/6] docs(spec): draft AgentEx subsection for SGP service
 decomposition ERD

Frames the AgentEx backend as a smaller monolith to decompose in mirror
of the parent ERD: agentex-state and agentex-conversations extracted as
Go services (high-write, latency-sensitive), agentex-tasks as Python,
agentex-control-plane as the residual after the three extractions.
Includes a committed-conditional Mongo->Postgres evaluation, sequencing,
and four open questions.
---
 .../2026-05-22-agentex-erd-section-design.md  | 72 +++++++++++++++++++
 1 file changed, 72 insertions(+)
 create mode 100644 docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md

diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
new file mode 100644
index 00000000..919614e0
--- /dev/null
+++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
@@ -0,0 +1,72 @@
+# AgentEx subsection — ERD: SGP Service Decomposition and Catalog
+
+| | |
+|---|---|
+| **Status** | Draft, in review |
+| **Date** | 2026-05-22 |
+| **Owner** | AgentEx team |
+| **Parent doc** | ERD: SGP Service Decomposition and Catalog |
+
+## Purpose of this document
+
+This is the proposed AgentEx subsection of the parent ERD, written as a mini-ERD applied to the AgentEx backend (the FastAPI + Temporal service in `scale-agentex/agentex/`). The parent ERD decomposes `egp-api-backend` into fourteen services; the AgentEx slot in the catalog is currently a stub. This document fills that stub by treating the AgentEx backend as a smaller monolith and proposing its own decomposition into four services, mirroring the parent ERD's structure (Problem Statement, Solution Statement, Service Inventory, per-service catalog bullets, Open Questions).
+
+Scope is the AgentEx backend only. `agentex-ui` and `agentex-agents/teams/*` are out of scope for this section.
+
+## Problem Statement
+
+The AgentEx backend (`scale-agentex/agentex/`) is a single FastAPI + Temporal service that today owns four distinct concerns inside one process and one set of data stores: a low-write **control plane** (agent registry, deployments, schedules, agent API keys), a **task lifecycle plane** (tasks, task agents, task tracker), a realtime **agent I/O surface** (messages, events, Redis-streams fanout), and an **agent-internal state surface** (per-task K/V state, LangGraph checkpoints). These four planes share one PostgreSQL cluster, one MongoDB cluster, one Redis instance, and one pod set sized for the worst of them.
+
+The current shape has measurable costs:
+
+- **Coupled scaling.** The agent I/O surface (messages + Redis streams) is realtime, fans out to UI subscribers, and scales with interactive session count. The agent-internal state surface scales with autonomous-agent step rate, which is much higher and bursty. The control plane is near-idle. Today all three share the same pods sized for the loudest of them.
+- **Shared failure domain.** A Mongo problem on the messages collection can block the agent-internal state surface (also Mongo). A noisy write path can starve other writers. The agent registry — which is on the critical path of every task creation — shares a process with the data plane. Operating both PostgreSQL and MongoDB also means two databases to deploy, monitor, and back up for what are largely append-and-key-value workloads.
+
+Spans are a related concern being addressed by other means: AgentEx will retire its `spans` table in favor of consuming `sgp-traces`, so spans are not a target of this decomposition.
+
+## Solution Statement
+
+Decompose the AgentEx backend into four services, each owning a coherent set of data and a coherent load profile. **`agentex-control-plane`** owns the agent registry and deployment registration surface — agents, deployments, deployment history, schedules, and agent API keys; low write rate, off the agent-execution hot path. **`agentex-tasks`** owns the task lifecycle — tasks, task→agent associations, and the agent task tracker; the canonical "what tasks exist and what state are they in" surface. **`agentex-conversations`** owns the realtime agent I/O surface — messages, events, and the Redis-streams fanout that carries them to UI subscribers; latency-sensitive, paired tightly with the streaming bus. **`agentex-state`** owns agent-internal data — per-task K/V state and LangGraph checkpoints; high write rate, written by the agent runtime for the agent runtime.
+
+As part of this migration, AgentEx **evaluates Postgres as the source of truth for messages and task states**, with the intent to commit to Postgres if performance is comparable to MongoDB. Today, MongoDB hosts both collections; co-location means a Mongo incident affects two unrelated services post-decomposition, and operating Mongo alongside Postgres is duplicated tooling, monitoring, and backup surface for what is effectively two ordered append/key-value workloads that Postgres can serve well. The decomposition is the natural moment to validate the swap: each new service is the boundary at which its store choice can change without rippling. If the evaluation succeeds, the resulting topology runs on one fewer database; if performance does not meet bar, the services keep their MongoDB stores and the boundary work still stands. Comparable performance is defined as **within 5–10% of the MongoDB baseline on write throughput and p99 latency**. A load-test baseline for the existing MongoDB infrastructure will be established prior to extraction 1; the same load test will be executed against the candidate Postgres topology to evaluate the swap.
+
+Sequence the extraction **leaves-first**, mirroring the parent ERD's protocol — pull `agentex-state` out first, then `agentex-conversations`, then `agentex-tasks`. After those three extractions, what remains in the agentex backend is `agentex-control-plane` by attrition; there is no separate extraction step for it, and further decomposition of the residual control plane can be revisited at that point if warranted. Each extraction follows the parent ERD's **per-service dress-rehearsal + physical-cutover** protocol: isolate the service's data within the shared source stores with restricted credentials, surface and fix cross-domain access violations, then cut over to dedicated infrastructure.
+
+Spans are not included in this decomposition. The `spans` table is being retired in favor of consuming `sgp-traces`: the AgentEx SDK has already removed local spans as a default tracing processor, but the receiving plumbing in the backend remains because legacy SDK versions are still in use. The deprecation path is to gradually no-op the spans endpoints as legacy agents roll off, allowing the code to be removed without breaking outstanding clients.
+
+## Service Inventory
+
+| # | Split | Owns | Go | Recommend | Notes |
+|---|---|---|---|---|---|
+| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. LangGraph-checkpointer-over-HTTP work needed — see Open Questions. |
+| 2 | `agentex-conversations` | `messages`\* + `events` + Redis-streams fanout; message and event routes | **Yes** | Yes | **Extract second.** Realtime I/O surface — SDK's `send_message` / `send_event` / streaming `task_message_context` paths terminate here. High write + latency-sensitive — Go rewrite target. Second of the two Mongo-owning services, so the Mongo→Postgres evaluation completes here. |
+| 3 | `agentex-tasks` | `tasks` + `task_agents` + `agent_task_tracker`; task lifecycle routes | No | Yes | **Extract third.** Canonical "what tasks exist and what state are they in" surface. Mid-throughput lifecycle CRUD; Python is fine, no Go affinity. Depends only on `agentex-control-plane` for inbound FKs. |
+| 4 | `agentex-control-plane` | `agents` + `deployments` + `deployment_history` + `schedules` + `agent_api_keys`; agent registration, deployment registration, schedule, API-key routes | No | Yes | **Residual.** What remains in the agentex backend after rows 1–3 are extracted. No separate extraction step. Hosts the AgentEx Temporal worker on a single task queue. Further decomposition revisited at that point if warranted. |
+
+> \* `messages` and `task_states` live in MongoDB today. The Mongo→Postgres swap is committed-conditional-on-performance per the Solution Statement; if performance does not meet bar, these services keep their MongoDB stores and the service boundaries still stand.
+
+> Note: `agentex-auth` is an existing AgentEx-owned service that already runs separately and is not part of this decomposition. It is included in the per-service catalog bullets below for completeness.
+
+## Per-service catalog bullets
+
+These are the bullets that drop into the "All-up SGP Service Catalog" section of the parent ERD under the AgentEx subsection.
+
+> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
+>
+> **`agentex-conversations`.** Owns `messages` (Mongo today, Postgres pending evaluation), `events`, and the Redis-streams fanout that carries them to UI subscribers. The realtime agent I/O surface — the SDK's `send_message`, `send_event`, and streaming `task_message_context` paths terminate here. Latency-sensitive, paired tightly with the streaming bus. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
+>
+> **`agentex-tasks`.** Owns `tasks`, `task_agents`, `agent_task_tracker`; task lifecycle routes including the cron-triggered task creation surface that pairs with control-plane schedules. The canonical "what tasks exist and what state are they in" surface; read by `agentex-state` and `agentex-conversations`. Python service. Outbound: `agentex-control-plane` (agent registry lookups), `agentex-auth` (authn / authz).
+>
+> **`agentex-control-plane`.** The residual AgentEx backend after the three extractions above. Owns `agents`, `deployments`, `deployment_history`, `schedules`, `agent_api_keys`; agent registration and lifecycle, deployment registration (the in-cluster registry of what version is currently registered and serving — distinct from `sgp-agent-deploy`'s build/push pipeline), cron schedules, and agent-scoped API key management. Hosts the AgentEx Temporal worker on a single task queue. Other AgentEx services do not use Temporal post-decomposition; if a future service needs it, it gets its own service-specific queue. Python service. Outbound: `agentex-auth` (authn / authz).
+>
+> **`agentex-auth`.** Existing AgentEx-owned authentication and authorization service; exposes `/v1/authn` and `/v1/authz/{grant,revoke,check,search}`. All AgentEx services authenticate and authorize requests against `agentex-auth` directly. Outbound: `sgp-identity` (delegated identity verification). Future direction may fold `agentex-auth` into `sgp-identity` as part of OneAuth — see Open Questions.
+
+## Open Questions
+
+1. **`agentex-auth` ↔ `sgp-identity` fold-in via OneAuth.** Today `agentex-auth` is a standalone AgentEx service that AgentEx services call directly and which delegates identity verification to `sgp-identity` underneath. The OneAuth direction may consolidate `agentex-auth` into `sgp-identity`, but the decision is not committed. Affects whether AgentEx services continue calling `agentex-auth` long-term or eventually call `sgp-identity` directly, and whether `agentex-auth`'s authz-policy responsibilities move with it.
+
+2. **LangGraph checkpointer transport.** LangGraph's stock `PostgresSaver` opens a direct Postgres connection from the agent process, which does not survive a service boundary when `agentex-state` becomes a separate Go service. Two options: (a) ship a custom HTTP-backed checkpointer in `agentex-sdk` that targets `agentex-state`'s API (preferred — clean boundary); or (b) expose Postgres protocol-compatible access from `agentex-state` (less SDK work, fuzzier service boundary).
+
+3. **Checkpoint lift-out trigger.** `agentex-state` bundles LangGraph checkpoints with general K/V state. Lift `checkpoints` into a separate `agentex-checkpoints` service if (a) non-LangGraph graph runtimes are adopted, or (b) checkpoint write rate dominates the service. Specific criteria to be sharpened in a follow-up.
+
+4. **`agentex-control-plane` ↔ `sgp-agent-deploy` boundary.** `sgp-agent-deploy` (per parent ERD) owns `agentex_cloud_builds`, `agentex_cloud_deploys`, `agentex_permissions` — the build/push pipeline. `agentex-control-plane` owns `deployments` + `deployment_history` — the in-cluster runtime registry of "what version is currently registered and serving." Pin the handoff between "build artifact ready" and "live agent registered and serving" so the contract is unambiguous.

From ef65fe4c4ffff36f21cd009c2ec09dbbdbafb5d0 Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:27:59 -0700
Subject: [PATCH 2/6] docs(spec): drop LangGraph checkpoint open question;
 contract already exists

PR #146 already established the HTTP checkpoint boundary: the SDK calls
the backend's /checkpoints/* endpoints via HttpCheckpointSaver rather
than connecting to Postgres directly. When agentex-state becomes a
separate Go service, the boundary exists today -- the Go service just
needs to serve the same contract. Renumber remaining open questions
(down to 3).
---
 .../specs/2026-05-22-agentex-erd-section-design.md     | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
index 919614e0..ce19d100 100644
--- a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
+++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
@@ -38,7 +38,7 @@ Spans are not included in this decomposition. The `spans` table is being retired
 
 | # | Split | Owns | Go | Recommend | Notes |
 |---|---|---|---|---|---|
-| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. LangGraph-checkpointer-over-HTTP work needed — see Open Questions. |
+| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. The HTTP checkpoint contract is already in place (PR #146): the SDK's `HttpCheckpointSaver` calls `/checkpoints/{get-tuple,put,put-writes,list,delete-thread}`, so the service boundary exists today — the Go service just needs to serve the same contract. |
 | 2 | `agentex-conversations` | `messages`\* + `events` + Redis-streams fanout; message and event routes | **Yes** | Yes | **Extract second.** Realtime I/O surface — SDK's `send_message` / `send_event` / streaming `task_message_context` paths terminate here. High write + latency-sensitive — Go rewrite target. Second of the two Mongo-owning services, so the Mongo→Postgres evaluation completes here. |
 | 3 | `agentex-tasks` | `tasks` + `task_agents` + `agent_task_tracker`; task lifecycle routes | No | Yes | **Extract third.** Canonical "what tasks exist and what state are they in" surface. Mid-throughput lifecycle CRUD; Python is fine, no Go affinity. Depends only on `agentex-control-plane` for inbound FKs. |
 | 4 | `agentex-control-plane` | `agents` + `deployments` + `deployment_history` + `schedules` + `agent_api_keys`; agent registration, deployment registration, schedule, API-key routes | No | Yes | **Residual.** What remains in the agentex backend after rows 1–3 are extracted. No separate extraction step. Hosts the AgentEx Temporal worker on a single task queue. Further decomposition revisited at that point if warranted. |
@@ -51,7 +51,7 @@ Spans are not included in this decomposition. The `spans` table is being retired
 
 These are the bullets that drop into the "All-up SGP Service Catalog" section of the parent ERD under the AgentEx subsection.
 
-> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
+> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. The HTTP checkpoint contract is already in place — the AgentEx SDK calls the backend's `/checkpoints/*` endpoints via `HttpCheckpointSaver` rather than connecting to Postgres directly, so the service boundary exists today. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
 >
 > **`agentex-conversations`.** Owns `messages` (Mongo today, Postgres pending evaluation), `events`, and the Redis-streams fanout that carries them to UI subscribers. The realtime agent I/O surface — the SDK's `send_message`, `send_event`, and streaming `task_message_context` paths terminate here. Latency-sensitive, paired tightly with the streaming bus. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
 >
@@ -65,8 +65,6 @@ These are the bullets that drop into the "All-up SGP Service Catalog" section of
 
 1. **`agentex-auth` ↔ `sgp-identity` fold-in via OneAuth.** Today `agentex-auth` is a standalone AgentEx service that AgentEx services call directly and which delegates identity verification to `sgp-identity` underneath. The OneAuth direction may consolidate `agentex-auth` into `sgp-identity`, but the decision is not committed. Affects whether AgentEx services continue calling `agentex-auth` long-term or eventually call `sgp-identity` directly, and whether `agentex-auth`'s authz-policy responsibilities move with it.
 
-2. **LangGraph checkpointer transport.** LangGraph's stock `PostgresSaver` opens a direct Postgres connection from the agent process, which does not survive a service boundary when `agentex-state` becomes a separate Go service. Two options: (a) ship a custom HTTP-backed checkpointer in `agentex-sdk` that targets `agentex-state`'s API (preferred — clean boundary); or (b) expose Postgres protocol-compatible access from `agentex-state` (less SDK work, fuzzier service boundary).
+2. **Checkpoint lift-out trigger.** `agentex-state` bundles LangGraph checkpoints with general K/V state. Lift `checkpoints` into a separate `agentex-checkpoints` service if (a) non-LangGraph graph runtimes are adopted, or (b) checkpoint write rate dominates the service. Specific criteria to be sharpened in a follow-up.
 
-3. **Checkpoint lift-out trigger.** `agentex-state` bundles LangGraph checkpoints with general K/V state. Lift `checkpoints` into a separate `agentex-checkpoints` service if (a) non-LangGraph graph runtimes are adopted, or (b) checkpoint write rate dominates the service. Specific criteria to be sharpened in a follow-up.
-
-4. **`agentex-control-plane` ↔ `sgp-agent-deploy` boundary.** `sgp-agent-deploy` (per parent ERD) owns `agentex_cloud_builds`, `agentex_cloud_deploys`, `agentex_permissions` — the build/push pipeline. `agentex-control-plane` owns `deployments` + `deployment_history` — the in-cluster runtime registry of "what version is currently registered and serving." Pin the handoff between "build artifact ready" and "live agent registered and serving" so the contract is unambiguous.
+3. **`agentex-control-plane` ↔ `sgp-agent-deploy` boundary.** `sgp-agent-deploy` (per parent ERD) owns `agentex_cloud_builds`, `agentex_cloud_deploys`, `agentex_permissions` — the build/push pipeline. `agentex-control-plane` owns `deployments` + `deployment_history` — the in-cluster runtime registry of "what version is currently registered and serving." Pin the handoff between "build artifact ready" and "live agent registered and serving" so the contract is unambiguous.

From aa3b38ba2edc07421459728d0c6c44c06ee38621 Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:30:26 -0700
Subject: [PATCH 3/6] docs(spec): note SDK-via-HTTP boundary applies to all
 data planes

The SDK already accesses task lifecycle, messages, task states,
checkpoints, events, and agent registration through HTTP endpoints
rather than direct database connections. Surface this as a general
property in the Solution Statement, and expand the agentex-state and
agentex-conversations bullets to reflect that the contract exists for
states and messages, not just checkpoints.
---
 .../specs/2026-05-22-agentex-erd-section-design.md        | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
index ce19d100..127689ee 100644
--- a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
+++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
@@ -28,6 +28,8 @@ Spans are a related concern being addressed by other means: AgentEx will retire
 
 Decompose the AgentEx backend into four services, each owning a coherent set of data and a coherent load profile. **`agentex-control-plane`** owns the agent registry and deployment registration surface — agents, deployments, deployment history, schedules, and agent API keys; low write rate, off the agent-execution hot path. **`agentex-tasks`** owns the task lifecycle — tasks, task→agent associations, and the agent task tracker; the canonical "what tasks exist and what state are they in" surface. **`agentex-conversations`** owns the realtime agent I/O surface — messages, events, and the Redis-streams fanout that carries them to UI subscribers; latency-sensitive, paired tightly with the streaming bus. **`agentex-state`** owns agent-internal data — per-task K/V state and LangGraph checkpoints; high write rate, written by the agent runtime for the agent runtime.
 
+A key property of AgentEx's existing architecture makes this decomposition tractable: the AgentEx SDK already accesses backend-managed data exclusively through HTTP endpoints, not by connecting to Postgres, MongoDB, or Redis directly. Task lifecycle, messages, task states, checkpoints, events, and agent registration all flow through the FastAPI surface today. The service-boundary contracts for the four target services therefore already exist; each extraction is a matter of reimplementing the same HTTP contract in the target service (Go for `agentex-state` and `agentex-conversations`, Python continuation for `agentex-tasks`), not of building a new API.
+
 As part of this migration, AgentEx **evaluates Postgres as the source of truth for messages and task states**, with the intent to commit to Postgres if performance is comparable to MongoDB. Today, MongoDB hosts both collections; co-location means a Mongo incident affects two unrelated services post-decomposition, and operating Mongo alongside Postgres is duplicated tooling, monitoring, and backup surface for what is effectively two ordered append/key-value workloads that Postgres can serve well. The decomposition is the natural moment to validate the swap: each new service is the boundary at which its store choice can change without rippling. If the evaluation succeeds, the resulting topology runs on one fewer database; if performance does not meet bar, the services keep their MongoDB stores and the boundary work still stands. Comparable performance is defined as **within 5–10% of the MongoDB baseline on write throughput and p99 latency**. A load-test baseline for the existing MongoDB infrastructure will be established prior to extraction 1; the same load test will be executed against the candidate Postgres topology to evaluate the swap.
 
 Sequence the extraction **leaves-first**, mirroring the parent ERD's protocol — pull `agentex-state` out first, then `agentex-conversations`, then `agentex-tasks`. After those three extractions, what remains in the agentex backend is `agentex-control-plane` by attrition; there is no separate extraction step for it, and further decomposition of the residual control plane can be revisited at that point if warranted. Each extraction follows the parent ERD's **per-service dress-rehearsal + physical-cutover** protocol: isolate the service's data within the shared source stores with restricted credentials, surface and fix cross-domain access violations, then cut over to dedicated infrastructure.
@@ -38,7 +40,7 @@ Spans are not included in this decomposition. The `spans` table is being retired
 
 | # | Split | Owns | Go | Recommend | Notes |
 |---|---|---|---|---|---|
-| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. The HTTP checkpoint contract is already in place (PR #146): the SDK's `HttpCheckpointSaver` calls `/checkpoints/{get-tuple,put,put-writes,list,delete-thread}`, so the service boundary exists today — the Go service just needs to serve the same contract. |
+| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. The HTTP contract exists today — SDK calls `/states/*` and `/checkpoints/*` — so the Go service just needs to serve the same contract. |
 | 2 | `agentex-conversations` | `messages`\* + `events` + Redis-streams fanout; message and event routes | **Yes** | Yes | **Extract second.** Realtime I/O surface — SDK's `send_message` / `send_event` / streaming `task_message_context` paths terminate here. High write + latency-sensitive — Go rewrite target. Second of the two Mongo-owning services, so the Mongo→Postgres evaluation completes here. |
 | 3 | `agentex-tasks` | `tasks` + `task_agents` + `agent_task_tracker`; task lifecycle routes | No | Yes | **Extract third.** Canonical "what tasks exist and what state are they in" surface. Mid-throughput lifecycle CRUD; Python is fine, no Go affinity. Depends only on `agentex-control-plane` for inbound FKs. |
 | 4 | `agentex-control-plane` | `agents` + `deployments` + `deployment_history` + `schedules` + `agent_api_keys`; agent registration, deployment registration, schedule, API-key routes | No | Yes | **Residual.** What remains in the agentex backend after rows 1–3 are extracted. No separate extraction step. Hosts the AgentEx Temporal worker on a single task queue. Further decomposition revisited at that point if warranted. |
@@ -51,9 +53,9 @@ Spans are not included in this decomposition. The `spans` table is being retired
 
 These are the bullets that drop into the "All-up SGP Service Catalog" section of the parent ERD under the AgentEx subsection.
 
-> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. The HTTP checkpoint contract is already in place — the AgentEx SDK calls the backend's `/checkpoints/*` endpoints via `HttpCheckpointSaver` rather than connecting to Postgres directly, so the service boundary exists today. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
+> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. The HTTP boundary already exists — the AgentEx SDK calls the backend's `/states/*` and `/checkpoints/*` endpoints rather than connecting to MongoDB or Postgres directly. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
 >
-> **`agentex-conversations`.** Owns `messages` (Mongo today, Postgres pending evaluation), `events`, and the Redis-streams fanout that carries them to UI subscribers. The realtime agent I/O surface — the SDK's `send_message`, `send_event`, and streaming `task_message_context` paths terminate here. Latency-sensitive, paired tightly with the streaming bus. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
+> **`agentex-conversations`.** Owns `messages` (Mongo today, Postgres pending evaluation), `events`, and the Redis-streams fanout that carries them to UI subscribers. The realtime agent I/O surface — the SDK's `send_message`, `send_event`, and streaming `task_message_context` paths terminate here, all calling the backend's HTTP endpoints rather than connecting to MongoDB or Redis directly. Latency-sensitive, paired tightly with the streaming bus. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz).
 >
 > **`agentex-tasks`.** Owns `tasks`, `task_agents`, `agent_task_tracker`; task lifecycle routes including the cron-triggered task creation surface that pairs with control-plane schedules. The canonical "what tasks exist and what state are they in" surface; read by `agentex-state` and `agentex-conversations`. Python service. Outbound: `agentex-control-plane` (agent registry lookups), `agentex-auth` (authn / authz).
 >

From bfa82115f98c946f543aa22e641da3708971a75c Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:41:07 -0700
Subject: [PATCH 4/6] docs(spec): resolve sgp-agent-deploy boundary; move
 checkpoint lift-out to notes

Replace two open questions with concrete content. The boundary with
sgp-agent-deploy is now spelled out in a dedicated Boundaries section:
sgp-agent-deploy ends at "pod running" (build artifact + Flux CRD),
agentex-control-plane begins at agent self-registration; permissions
are deploy-time vs runtime-credentials. Checkpoint lift-out becomes a
forward-looking note rather than an open question -- something to
revisit as LangGraph usage grows. Open Questions reduces to the one
genuinely undecided item (agentex-auth/sgp-identity via OneAuth).
---
 .../2026-05-22-agentex-erd-section-design.md  | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
index 127689ee..6c7a0a95 100644
--- a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
+++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
@@ -63,10 +63,21 @@ These are the bullets that drop into the "All-up SGP Service Catalog" section of
 >
 > **`agentex-auth`.** Existing AgentEx-owned authentication and authorization service; exposes `/v1/authn` and `/v1/authz/{grant,revoke,check,search}`. All AgentEx services authenticate and authorize requests against `agentex-auth` directly. Outbound: `sgp-identity` (delegated identity verification). Future direction may fold `agentex-auth` into `sgp-identity` as part of OneAuth — see Open Questions.
 
-## Open Questions
+## Boundaries with adjacent services
 
-1. **`agentex-auth` ↔ `sgp-identity` fold-in via OneAuth.** Today `agentex-auth` is a standalone AgentEx service that AgentEx services call directly and which delegates identity verification to `sgp-identity` underneath. The OneAuth direction may consolidate `agentex-auth` into `sgp-identity`, but the decision is not committed. Affects whether AgentEx services continue calling `agentex-auth` long-term or eventually call `sgp-identity` directly, and whether `agentex-auth`'s authz-policy responsibilities move with it.
+### `agentex-control-plane` ↔ `sgp-agent-deploy`
+
+The two services own complementary halves of the agent deployment lifecycle, with no direct API call between them.
+
+- **`sgp-agent-deploy`** owns the build and deployment pipeline: building container images (`agentex_cloud_builds`), creating and updating the Flux CRDs / Kubernetes manifests that schedule agent pods (`agentex_cloud_deploys`), and the permissions governing who can deploy what (`agentex_permissions`). Its scope ends when the agent pod is running.
+- **`agentex-control-plane`** owns the runtime side: agent self-registration on pod startup, the in-cluster registry of what version is currently registered and serving (`deployments`, `deployment_history`), and runtime credentials (`agent_api_keys`). Its scope begins when the agent pod registers itself.
+
+The handoff is implicit: once `sgp-agent-deploy` reaches a "pod running" state via Flux, the agent process inside the pod calls `agentex-control-plane` to register itself. The two services never call each other directly. Permissions are similarly scope-divided: `agentex_permissions` in `sgp-agent-deploy` is a deploy-time concern (who can deploy what), distinct from `agent_api_keys` in `agentex-control-plane`, which is the runtime credential an agent uses to call back into the platform once it is running.
 
-2. **Checkpoint lift-out trigger.** `agentex-state` bundles LangGraph checkpoints with general K/V state. Lift `checkpoints` into a separate `agentex-checkpoints` service if (a) non-LangGraph graph runtimes are adopted, or (b) checkpoint write rate dominates the service. Specific criteria to be sharpened in a follow-up.
+## Forward-looking notes
 
-3. **`agentex-control-plane` ↔ `sgp-agent-deploy` boundary.** `sgp-agent-deploy` (per parent ERD) owns `agentex_cloud_builds`, `agentex_cloud_deploys`, `agentex_permissions` — the build/push pipeline. `agentex-control-plane` owns `deployments` + `deployment_history` — the in-cluster runtime registry of "what version is currently registered and serving." Pin the handoff between "build artifact ready" and "live agent registered and serving" so the contract is unambiguous.
+- **Checkpoint lift-out.** `agentex-state` bundles LangGraph checkpoints with general K/V state. If non-LangGraph graph runtimes are adopted, or if checkpoint write rate comes to dominate the service, consider lifting checkpoints into a separate `agentex-checkpoints` service. Revisit periodically as LangGraph usage grows.
+
+## Open Questions
+
+1. **`agentex-auth` ↔ `sgp-identity` fold-in via OneAuth.** Today `agentex-auth` is a standalone AgentEx service that AgentEx services call directly and which delegates identity verification to `sgp-identity` underneath. The OneAuth direction may consolidate `agentex-auth` into `sgp-identity`, but the decision is not committed. Affects whether AgentEx services continue calling `agentex-auth` long-term or eventually call `sgp-identity` directly, and whether `agentex-auth`'s authz-policy responsibilities move with it.

From 1280722169ce5f288c3c6702085888a095726aa0 Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:44:32 -0700
Subject: [PATCH 5/6] docs(spec): close last open question; agentex-auth stays
 near-term

OneAuth fold-in is a longer-term consideration, not a blocking decision
for this ERD section. Move the awareness to Forward-looking notes
alongside the checkpoint lift-out trigger, and drop the Open Questions
section entirely.
---
 .../specs/2026-05-22-agentex-erd-section-design.md         | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
index 6c7a0a95..add83dbe 100644
--- a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
+++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md
@@ -61,7 +61,7 @@ These are the bullets that drop into the "All-up SGP Service Catalog" section of
 >
 > **`agentex-control-plane`.** The residual AgentEx backend after the three extractions above. Owns `agents`, `deployments`, `deployment_history`, `schedules`, `agent_api_keys`; agent registration and lifecycle, deployment registration (the in-cluster registry of what version is currently registered and serving — distinct from `sgp-agent-deploy`'s build/push pipeline), cron schedules, and agent-scoped API key management. Hosts the AgentEx Temporal worker on a single task queue. Other AgentEx services do not use Temporal post-decomposition; if a future service needs it, it gets its own service-specific queue. Python service. Outbound: `agentex-auth` (authn / authz).
 >
-> **`agentex-auth`.** Existing AgentEx-owned authentication and authorization service; exposes `/v1/authn` and `/v1/authz/{grant,revoke,check,search}`. All AgentEx services authenticate and authorize requests against `agentex-auth` directly. Outbound: `sgp-identity` (delegated identity verification). Future direction may fold `agentex-auth` into `sgp-identity` as part of OneAuth — see Open Questions.
+> **`agentex-auth`.** Existing AgentEx-owned authentication and authorization service; exposes `/v1/authn` and `/v1/authz/{grant,revoke,check,search}`. All AgentEx services authenticate and authorize requests against `agentex-auth` directly. Outbound: `sgp-identity` (delegated identity verification).
 
 ## Boundaries with adjacent services
 
@@ -77,7 +77,4 @@ The handoff is implicit: once `sgp-agent-deploy` reaches a "pod running" state v
 ## Forward-looking notes
 
 - **Checkpoint lift-out.** `agentex-state` bundles LangGraph checkpoints with general K/V state. If non-LangGraph graph runtimes are adopted, or if checkpoint write rate comes to dominate the service, consider lifting checkpoints into a separate `agentex-checkpoints` service. Revisit periodically as LangGraph usage grows.
-
-## Open Questions
-
-1. **`agentex-auth` ↔ `sgp-identity` fold-in via OneAuth.** Today `agentex-auth` is a standalone AgentEx service that AgentEx services call directly and which delegates identity verification to `sgp-identity` underneath. The OneAuth direction may consolidate `agentex-auth` into `sgp-identity`, but the decision is not committed. Affects whether AgentEx services continue calling `agentex-auth` long-term or eventually call `sgp-identity` directly, and whether `agentex-auth`'s authz-policy responsibilities move with it.
+- **`agentex-auth` ↔ `sgp-identity` via OneAuth.** `agentex-auth` remains a standalone AgentEx service in the near term; the OneAuth direction may eventually consolidate it into `sgp-identity`. If that happens, AgentEx services would call `sgp-identity` directly, and `agentex-auth`'s authz-policy responsibilities would need to move with it. Revisit when OneAuth direction firms up.

From 134b179c4da34724aaa7950dab67e6a967e87cf9 Mon Sep 17 00:00:00 2001
From: Stas Moreinis <stas.moreinis@scale.com>
Date: Fri, 22 May 2026 15:51:17 -0700
Subject: [PATCH 6/6] docs(plan): coordination plan for landing AgentEx ERD
 subsection

Six discrete tasks: internal AgentEx team review, sgp-agent-deploy
boundary sign-off, OneAuth heads-up, mini-ERD location decision, paste
catalog bullets into parent ERD, announce. Coordination work rather
than software implementation -- subagent-driven and inline-execution
sub-skills do not apply.
---
 ...-05-22-agentex-erd-section-coordination.md | 142 ++++++++++++++++++
 1 file changed, 142 insertions(+)
 create mode 100644 docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md

diff --git a/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md b/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md
new file mode 100644
index 00000000..ef14eda5
--- /dev/null
+++ b/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md
@@ -0,0 +1,142 @@
+# AgentEx ERD Subsection Landing — Coordination Plan
+
+> **Note on format:** This is a coordination plan, not a software-implementation plan. Steps are discrete operator actions rather than TDD cycles. The subagent-driven-development and executing-plans skills do not apply; the operator runs this themselves.
+
+**Goal:** Land the AgentEx per-service catalog bullets in the parent ERD (`ERD: SGP Service Decomposition and Catalog`), replacing the current "Stub — to be populated by the AgentEx team" placeholder.
+
+**Approach:** Run an internal AgentEx team review of the full spec, get explicit sign-off from the `sgp-agent-deploy` owner on the boundary section, give OneAuth folks an informational heads-up, decide where the AgentEx-internal mini-ERD lives, then paste the catalog bullets into the parent Notion page.
+
+**Source artifact:** `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md`
+
+---
+
+## Scope: what lands upstream vs. what stays internal
+
+The parent ERD is the SGP decomposition ERD. Only the per-service catalog bullets need to land there:
+
+- `agentex-state`
+- `agentex-conversations`
+- `agentex-tasks`
+- `agentex-control-plane`
+- `agentex-auth`
+
+The rest of the spec (Problem Statement, Solution Statement, Service Inventory, Boundaries section, Forward-looking notes) is the AgentEx-internal mini-ERD. It either stays as the `docs/superpowers/specs/` artifact only, or becomes a separate Notion page linked from the parent ERD. Decision is Task 4 below.
+
+---
+
+## Task 1: AgentEx team internal review
+
+**Goal:** AgentEx team consensus that the spec reflects our direction.
+
+- [ ] **Step 1:** Share the spec with the AgentEx team
+  - Post in the AgentEx team's primary review channel with: link to `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md` (or the branch `agentex-erd-section-design` on GitHub once pushed), one-paragraph summary, deadline for feedback (suggested: one week).
+- [ ] **Step 2:** Collect feedback
+  - Address blocking feedback by editing the spec on the branch and committing.
+  - Items framed as "we should think about this later" go into the spec's Forward-looking notes section rather than the catalog bullets.
+- [ ] **Step 3:** Verify consensus
+  - Each blocking comment has been resolved (responded to in-thread or by a spec edit).
+  - No outstanding "we disagree with this whole direction" feedback.
+- [ ] **Step 4:** Commit any spec updates from review
+  - Use a single commit per round of feedback, message format: `docs(spec): incorporate AgentEx team review feedback - <one line>`
+
+---
+
+## Task 2: Cross-team alignment — `sgp-agent-deploy` boundary
+
+**Goal:** Written acknowledgment from the `sgp-agent-deploy` owner that the boundary section matches their understanding.
+
+- [ ] **Step 1:** Identify the `sgp-agent-deploy` owner
+  - From the parent ERD page properties, or by asking the parent ERD author. (The parent doc lists the service in the inventory; the owner is the natural review contact.)
+- [ ] **Step 2:** Share the Boundaries section
+  - Send the "Boundaries with adjacent services → `agentex-control-plane` ↔ `sgp-agent-deploy`" section text directly (not just a link — they may not want to read the full spec).
+  - Ask explicitly: "Does the handoff as described — `sgp-agent-deploy` ends at pod running, `agentex-control-plane` begins at agent self-registration, no direct API call between the two — match your understanding?"
+- [ ] **Step 3:** Capture the response
+  - If acknowledged: capture the confirmation (Slack permalink or Notion comment).
+  - If disputed: revise the Boundaries section accordingly, commit, and re-confirm.
+- [ ] **Step 4:** Verify
+  - Boundaries section has explicit sign-off from the `sgp-agent-deploy` owner.
+
+---
+
+## Task 3: Cross-team alignment — OneAuth direction (informational)
+
+**Goal:** OneAuth folks are aware that AgentEx has the `agentex-auth` → `sgp-identity` fold-in as a forward-looking item; not a gate.
+
+- [ ] **Step 1:** Identify the OneAuth lead
+  - The parent ERD or `sgp-identity` documentation should name this person.
+- [ ] **Step 2:** Send a brief informational note
+  - Share the relevant Forward-looking notes bullet from the spec.
+  - Frame it as a heads-up: "AgentEx is keeping `agentex-auth` standalone in the near term; if OneAuth ends up consolidating, we'll need to revisit. No action needed from your side right now."
+- [ ] **Step 3:** Capture any pushback
+  - If the OneAuth lead has strong feelings about timing or commits AgentEx to a specific direction, capture in the spec's Forward-looking notes or as a Boundaries section addendum.
+
+---
+
+## Task 4: Decide where the AgentEx-internal mini-ERD lives
+
+**Goal:** Pick one of two locations for the Problem Statement / Solution Statement / Service Inventory / Boundaries / Forward-looking notes content.
+
+- [ ] **Step 1:** Pick one of:
+  - **Option A — stays in repo only.** Internal mini-ERD lives as `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md`. No Notion presence beyond the catalog bullets in the parent ERD. Cheapest. Loses visibility for non-AgentEx folks who want to dig in.
+  - **Option B — Notion page linked from parent ERD.** Create a Notion page titled "AgentEx Decomposition" containing the same content; link to it from the AgentEx subsection of the parent ERD. Higher visibility, more upkeep (two sources of truth — pick one as canonical).
+- [ ] **Step 2:** If Option B: create the Notion page
+  - Copy the relevant sections from the spec into Notion. Mark the Notion page as canonical (and add a note at the top of the repo spec pointing to it).
+- [ ] **Step 3:** If Option A: skip — done.
+
+---
+
+## Task 5: Land the catalog bullets in the parent ERD
+
+**Goal:** "AgentEx services" subsection of the parent ERD's All-up SGP Service Catalog contains the five catalog bullets, replacing the current "Stub" placeholder.
+
+- [ ] **Step 1:** Confirm Tasks 1, 2, 3, 4 are complete
+  - Internal review consensus ✓
+  - `sgp-agent-deploy` boundary acknowledged ✓
+  - OneAuth heads-up sent ✓
+  - Mini-ERD location decided (and Notion page created if Option B) ✓
+- [ ] **Step 2:** Get write access to the parent ERD page
+  - Either edit access directly, or coordinate with the parent ERD owner to paste on your behalf.
+- [ ] **Step 3:** Paste the five catalog bullets
+  - Copy verbatim from `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md`, "Per-service catalog bullets" section.
+  - Replace the current "Stub — to be populated by the AgentEx team. AgentEx platform code lives in `scale-agentex/` (`agentex`, `agentex-ui`) and agent implementations live in `agentex-agents/teams/*`." line.
+- [ ] **Step 4:** If Task 4 chose Option B: add a "See also" link
+  - Below the catalog bullets, add: "See [AgentEx Decomposition](link) for the AgentEx-internal mini-ERD."
+- [ ] **Step 5:** Verify
+  - Catalog bullets render correctly in Notion.
+  - Service names link consistently with how other catalog bullets handle service references (e.g. backticks or Notion mentions).
+
+---
+
+## Task 6: Announce
+
+**Goal:** Notify AgentEx team and parent ERD audience that the section has landed.
+
+- [ ] **Step 1:** Post in the AgentEx team's primary channel
+  - One-paragraph note: "AgentEx subsection of the parent ERD is now populated. Link: [parent ERD section]. Internal mini-ERD: [Notion page or repo path]. Open to follow-up questions."
+- [ ] **Step 2:** Notify the parent ERD owner
+  - "AgentEx section is in. Stub line replaced with five catalog bullets. `sgp-agent-deploy` boundary signed off by [owner]. OneAuth heads-up sent. No outstanding open questions."
+- [ ] **Step 3:** Update the spec's status
+  - Edit `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md` header: change `Status: Draft, in review` to `Status: Landed in parent ERD on YYYY-MM-DD`.
+  - Commit: `docs(spec): mark agentex ERD section as landed`
+
+---
+
+## Definition of done
+
+- AgentEx subsection in parent ERD contains the five catalog bullets (no longer "Stub").
+- `sgp-agent-deploy` owner has explicitly acknowledged the Boundaries section.
+- OneAuth folks have been informed of the forward-looking item.
+- Mini-ERD location decided (Option A or B).
+- AgentEx team and parent ERD owner notified.
+- Spec status updated.
+
+---
+
+## Out of scope (and why)
+
+These are deliberately not in this plan:
+
+- **Building the Go services.** Each extraction is its own multi-week initiative that needs its own design pass before it can be planned. This plan only lands the design document.
+- **Mongo→Postgres performance load tests.** These happen at the start of extraction 1, not during this coordination work.
+- **Per-extraction sequencing details (dress rehearsal protocol, cutover specifics).** Each extraction will have its own design and plan.
+- **Retiring the `spans` endpoints.** Already in progress separately; spec only notes the direction.