diff --git a/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md b/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md new file mode 100644 index 00000000..ef14eda5 --- /dev/null +++ b/docs/superpowers/plans/2026-05-22-agentex-erd-section-coordination.md @@ -0,0 +1,142 @@ +# AgentEx ERD Subsection Landing — Coordination Plan + +> **Note on format:** This is a coordination plan, not a software-implementation plan. Steps are discrete operator actions rather than TDD cycles. The subagent-driven-development and executing-plans skills do not apply; the operator runs this themselves. + +**Goal:** Land the AgentEx per-service catalog bullets in the parent ERD (`ERD: SGP Service Decomposition and Catalog`), replacing the current "Stub — to be populated by the AgentEx team" placeholder. + +**Approach:** Run an internal AgentEx team review of the full spec, get explicit sign-off from the `sgp-agent-deploy` owner on the boundary section, give OneAuth folks an informational heads-up, decide where the AgentEx-internal mini-ERD lives, then paste the catalog bullets into the parent Notion page. + +**Source artifact:** `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md` + +--- + +## Scope: what lands upstream vs. what stays internal + +The parent ERD is the SGP decomposition ERD. Only the per-service catalog bullets need to land there: + +- `agentex-state` +- `agentex-conversations` +- `agentex-tasks` +- `agentex-control-plane` +- `agentex-auth` + +The rest of the spec (Problem Statement, Solution Statement, Service Inventory, Boundaries section, Forward-looking notes) is the AgentEx-internal mini-ERD. It either stays as the `docs/superpowers/specs/` artifact only, or becomes a separate Notion page linked from the parent ERD. Decision is Task 4 below. + +--- + +## Task 1: AgentEx team internal review + +**Goal:** AgentEx team consensus that the spec reflects our direction. + +- [ ] **Step 1:** Share the spec with the AgentEx team + - Post in the AgentEx team's primary review channel with: link to `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md` (or the branch `agentex-erd-section-design` on GitHub once pushed), one-paragraph summary, deadline for feedback (suggested: one week). +- [ ] **Step 2:** Collect feedback + - Address blocking feedback by editing the spec on the branch and committing. + - Items framed as "we should think about this later" go into the spec's Forward-looking notes section rather than the catalog bullets. +- [ ] **Step 3:** Verify consensus + - Each blocking comment has been resolved (responded to in-thread or by a spec edit). + - No outstanding "we disagree with this whole direction" feedback. +- [ ] **Step 4:** Commit any spec updates from review + - Use a single commit per round of feedback, message format: `docs(spec): incorporate AgentEx team review feedback - ` + +--- + +## Task 2: Cross-team alignment — `sgp-agent-deploy` boundary + +**Goal:** Written acknowledgment from the `sgp-agent-deploy` owner that the boundary section matches their understanding. + +- [ ] **Step 1:** Identify the `sgp-agent-deploy` owner + - From the parent ERD page properties, or by asking the parent ERD author. (The parent doc lists the service in the inventory; the owner is the natural review contact.) +- [ ] **Step 2:** Share the Boundaries section + - Send the "Boundaries with adjacent services → `agentex-control-plane` ↔ `sgp-agent-deploy`" section text directly (not just a link — they may not want to read the full spec). + - Ask explicitly: "Does the handoff as described — `sgp-agent-deploy` ends at pod running, `agentex-control-plane` begins at agent self-registration, no direct API call between the two — match your understanding?" +- [ ] **Step 3:** Capture the response + - If acknowledged: capture the confirmation (Slack permalink or Notion comment). + - If disputed: revise the Boundaries section accordingly, commit, and re-confirm. +- [ ] **Step 4:** Verify + - Boundaries section has explicit sign-off from the `sgp-agent-deploy` owner. + +--- + +## Task 3: Cross-team alignment — OneAuth direction (informational) + +**Goal:** OneAuth folks are aware that AgentEx has the `agentex-auth` → `sgp-identity` fold-in as a forward-looking item; not a gate. + +- [ ] **Step 1:** Identify the OneAuth lead + - The parent ERD or `sgp-identity` documentation should name this person. +- [ ] **Step 2:** Send a brief informational note + - Share the relevant Forward-looking notes bullet from the spec. + - Frame it as a heads-up: "AgentEx is keeping `agentex-auth` standalone in the near term; if OneAuth ends up consolidating, we'll need to revisit. No action needed from your side right now." +- [ ] **Step 3:** Capture any pushback + - If the OneAuth lead has strong feelings about timing or commits AgentEx to a specific direction, capture in the spec's Forward-looking notes or as a Boundaries section addendum. + +--- + +## Task 4: Decide where the AgentEx-internal mini-ERD lives + +**Goal:** Pick one of two locations for the Problem Statement / Solution Statement / Service Inventory / Boundaries / Forward-looking notes content. + +- [ ] **Step 1:** Pick one of: + - **Option A — stays in repo only.** Internal mini-ERD lives as `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md`. No Notion presence beyond the catalog bullets in the parent ERD. Cheapest. Loses visibility for non-AgentEx folks who want to dig in. + - **Option B — Notion page linked from parent ERD.** Create a Notion page titled "AgentEx Decomposition" containing the same content; link to it from the AgentEx subsection of the parent ERD. Higher visibility, more upkeep (two sources of truth — pick one as canonical). +- [ ] **Step 2:** If Option B: create the Notion page + - Copy the relevant sections from the spec into Notion. Mark the Notion page as canonical (and add a note at the top of the repo spec pointing to it). +- [ ] **Step 3:** If Option A: skip — done. + +--- + +## Task 5: Land the catalog bullets in the parent ERD + +**Goal:** "AgentEx services" subsection of the parent ERD's All-up SGP Service Catalog contains the five catalog bullets, replacing the current "Stub" placeholder. + +- [ ] **Step 1:** Confirm Tasks 1, 2, 3, 4 are complete + - Internal review consensus ✓ + - `sgp-agent-deploy` boundary acknowledged ✓ + - OneAuth heads-up sent ✓ + - Mini-ERD location decided (and Notion page created if Option B) ✓ +- [ ] **Step 2:** Get write access to the parent ERD page + - Either edit access directly, or coordinate with the parent ERD owner to paste on your behalf. +- [ ] **Step 3:** Paste the five catalog bullets + - Copy verbatim from `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md`, "Per-service catalog bullets" section. + - Replace the current "Stub — to be populated by the AgentEx team. AgentEx platform code lives in `scale-agentex/` (`agentex`, `agentex-ui`) and agent implementations live in `agentex-agents/teams/*`." line. +- [ ] **Step 4:** If Task 4 chose Option B: add a "See also" link + - Below the catalog bullets, add: "See [AgentEx Decomposition](link) for the AgentEx-internal mini-ERD." +- [ ] **Step 5:** Verify + - Catalog bullets render correctly in Notion. + - Service names link consistently with how other catalog bullets handle service references (e.g. backticks or Notion mentions). + +--- + +## Task 6: Announce + +**Goal:** Notify AgentEx team and parent ERD audience that the section has landed. + +- [ ] **Step 1:** Post in the AgentEx team's primary channel + - One-paragraph note: "AgentEx subsection of the parent ERD is now populated. Link: [parent ERD section]. Internal mini-ERD: [Notion page or repo path]. Open to follow-up questions." +- [ ] **Step 2:** Notify the parent ERD owner + - "AgentEx section is in. Stub line replaced with five catalog bullets. `sgp-agent-deploy` boundary signed off by [owner]. OneAuth heads-up sent. No outstanding open questions." +- [ ] **Step 3:** Update the spec's status + - Edit `docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md` header: change `Status: Draft, in review` to `Status: Landed in parent ERD on YYYY-MM-DD`. + - Commit: `docs(spec): mark agentex ERD section as landed` + +--- + +## Definition of done + +- AgentEx subsection in parent ERD contains the five catalog bullets (no longer "Stub"). +- `sgp-agent-deploy` owner has explicitly acknowledged the Boundaries section. +- OneAuth folks have been informed of the forward-looking item. +- Mini-ERD location decided (Option A or B). +- AgentEx team and parent ERD owner notified. +- Spec status updated. + +--- + +## Out of scope (and why) + +These are deliberately not in this plan: + +- **Building the Go services.** Each extraction is its own multi-week initiative that needs its own design pass before it can be planned. This plan only lands the design document. +- **Mongo→Postgres performance load tests.** These happen at the start of extraction 1, not during this coordination work. +- **Per-extraction sequencing details (dress rehearsal protocol, cutover specifics).** Each extraction will have its own design and plan. +- **Retiring the `spans` endpoints.** Already in progress separately; spec only notes the direction. diff --git a/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md new file mode 100644 index 00000000..add83dbe --- /dev/null +++ b/docs/superpowers/specs/2026-05-22-agentex-erd-section-design.md @@ -0,0 +1,80 @@ +# AgentEx subsection — ERD: SGP Service Decomposition and Catalog + +| | | +|---|---| +| **Status** | Draft, in review | +| **Date** | 2026-05-22 | +| **Owner** | AgentEx team | +| **Parent doc** | ERD: SGP Service Decomposition and Catalog | + +## Purpose of this document + +This is the proposed AgentEx subsection of the parent ERD, written as a mini-ERD applied to the AgentEx backend (the FastAPI + Temporal service in `scale-agentex/agentex/`). The parent ERD decomposes `egp-api-backend` into fourteen services; the AgentEx slot in the catalog is currently a stub. This document fills that stub by treating the AgentEx backend as a smaller monolith and proposing its own decomposition into four services, mirroring the parent ERD's structure (Problem Statement, Solution Statement, Service Inventory, per-service catalog bullets, Open Questions). + +Scope is the AgentEx backend only. `agentex-ui` and `agentex-agents/teams/*` are out of scope for this section. + +## Problem Statement + +The AgentEx backend (`scale-agentex/agentex/`) is a single FastAPI + Temporal service that today owns four distinct concerns inside one process and one set of data stores: a low-write **control plane** (agent registry, deployments, schedules, agent API keys), a **task lifecycle plane** (tasks, task agents, task tracker), a realtime **agent I/O surface** (messages, events, Redis-streams fanout), and an **agent-internal state surface** (per-task K/V state, LangGraph checkpoints). These four planes share one PostgreSQL cluster, one MongoDB cluster, one Redis instance, and one pod set sized for the worst of them. + +The current shape has measurable costs: + +- **Coupled scaling.** The agent I/O surface (messages + Redis streams) is realtime, fans out to UI subscribers, and scales with interactive session count. The agent-internal state surface scales with autonomous-agent step rate, which is much higher and bursty. The control plane is near-idle. Today all three share the same pods sized for the loudest of them. +- **Shared failure domain.** A Mongo problem on the messages collection can block the agent-internal state surface (also Mongo). A noisy write path can starve other writers. The agent registry — which is on the critical path of every task creation — shares a process with the data plane. Operating both PostgreSQL and MongoDB also means two databases to deploy, monitor, and back up for what are largely append-and-key-value workloads. + +Spans are a related concern being addressed by other means: AgentEx will retire its `spans` table in favor of consuming `sgp-traces`, so spans are not a target of this decomposition. + +## Solution Statement + +Decompose the AgentEx backend into four services, each owning a coherent set of data and a coherent load profile. **`agentex-control-plane`** owns the agent registry and deployment registration surface — agents, deployments, deployment history, schedules, and agent API keys; low write rate, off the agent-execution hot path. **`agentex-tasks`** owns the task lifecycle — tasks, task→agent associations, and the agent task tracker; the canonical "what tasks exist and what state are they in" surface. **`agentex-conversations`** owns the realtime agent I/O surface — messages, events, and the Redis-streams fanout that carries them to UI subscribers; latency-sensitive, paired tightly with the streaming bus. **`agentex-state`** owns agent-internal data — per-task K/V state and LangGraph checkpoints; high write rate, written by the agent runtime for the agent runtime. + +A key property of AgentEx's existing architecture makes this decomposition tractable: the AgentEx SDK already accesses backend-managed data exclusively through HTTP endpoints, not by connecting to Postgres, MongoDB, or Redis directly. Task lifecycle, messages, task states, checkpoints, events, and agent registration all flow through the FastAPI surface today. The service-boundary contracts for the four target services therefore already exist; each extraction is a matter of reimplementing the same HTTP contract in the target service (Go for `agentex-state` and `agentex-conversations`, Python continuation for `agentex-tasks`), not of building a new API. + +As part of this migration, AgentEx **evaluates Postgres as the source of truth for messages and task states**, with the intent to commit to Postgres if performance is comparable to MongoDB. Today, MongoDB hosts both collections; co-location means a Mongo incident affects two unrelated services post-decomposition, and operating Mongo alongside Postgres is duplicated tooling, monitoring, and backup surface for what is effectively two ordered append/key-value workloads that Postgres can serve well. The decomposition is the natural moment to validate the swap: each new service is the boundary at which its store choice can change without rippling. If the evaluation succeeds, the resulting topology runs on one fewer database; if performance does not meet bar, the services keep their MongoDB stores and the boundary work still stands. Comparable performance is defined as **within 5–10% of the MongoDB baseline on write throughput and p99 latency**. A load-test baseline for the existing MongoDB infrastructure will be established prior to extraction 1; the same load test will be executed against the candidate Postgres topology to evaluate the swap. + +Sequence the extraction **leaves-first**, mirroring the parent ERD's protocol — pull `agentex-state` out first, then `agentex-conversations`, then `agentex-tasks`. After those three extractions, what remains in the agentex backend is `agentex-control-plane` by attrition; there is no separate extraction step for it, and further decomposition of the residual control plane can be revisited at that point if warranted. Each extraction follows the parent ERD's **per-service dress-rehearsal + physical-cutover** protocol: isolate the service's data within the shared source stores with restricted credentials, surface and fix cross-domain access violations, then cut over to dedicated infrastructure. + +Spans are not included in this decomposition. The `spans` table is being retired in favor of consuming `sgp-traces`: the AgentEx SDK has already removed local spans as a default tracing processor, but the receiving plumbing in the backend remains because legacy SDK versions are still in use. The deprecation path is to gradually no-op the spans endpoints as legacy agents roll off, allowing the code to be removed without breaking outstanding clients. + +## Service Inventory + +| # | Split | Owns | Go | Recommend | Notes | +|---|---|---|---|---|---| +| 1 | `agentex-state` | `task_states`\* + `checkpoints` + `checkpoint_blobs` + `checkpoint_writes` + `checkpoint_migrations`; state and checkpoint routes | **Yes** | Yes | **Extract first.** Simplest surface; test-drives the per-service extraction protocol. High write rate (per-agent-step) — Go rewrite target for throughput. The HTTP contract exists today — SDK calls `/states/*` and `/checkpoints/*` — so the Go service just needs to serve the same contract. | +| 2 | `agentex-conversations` | `messages`\* + `events` + Redis-streams fanout; message and event routes | **Yes** | Yes | **Extract second.** Realtime I/O surface — SDK's `send_message` / `send_event` / streaming `task_message_context` paths terminate here. High write + latency-sensitive — Go rewrite target. Second of the two Mongo-owning services, so the Mongo→Postgres evaluation completes here. | +| 3 | `agentex-tasks` | `tasks` + `task_agents` + `agent_task_tracker`; task lifecycle routes | No | Yes | **Extract third.** Canonical "what tasks exist and what state are they in" surface. Mid-throughput lifecycle CRUD; Python is fine, no Go affinity. Depends only on `agentex-control-plane` for inbound FKs. | +| 4 | `agentex-control-plane` | `agents` + `deployments` + `deployment_history` + `schedules` + `agent_api_keys`; agent registration, deployment registration, schedule, API-key routes | No | Yes | **Residual.** What remains in the agentex backend after rows 1–3 are extracted. No separate extraction step. Hosts the AgentEx Temporal worker on a single task queue. Further decomposition revisited at that point if warranted. | + +> \* `messages` and `task_states` live in MongoDB today. The Mongo→Postgres swap is committed-conditional-on-performance per the Solution Statement; if performance does not meet bar, these services keep their MongoDB stores and the service boundaries still stand. + +> Note: `agentex-auth` is an existing AgentEx-owned service that already runs separately and is not part of this decomposition. It is included in the per-service catalog bullets below for completeness. + +## Per-service catalog bullets + +These are the bullets that drop into the "All-up SGP Service Catalog" section of the parent ERD under the AgentEx subsection. + +> **`agentex-state`.** Owns `task_states` (Mongo today, Postgres pending evaluation) and the LangGraph checkpointer tables (`checkpoints`, `checkpoint_blobs`, `checkpoint_writes`, `checkpoint_migrations`). Per-task K/V state and graph-checkpoint storage for the AgentEx runtime; high write rate, written by the agent runtime for the agent runtime. The HTTP boundary already exists — the AgentEx SDK calls the backend's `/states/*` and `/checkpoints/*` endpoints rather than connecting to MongoDB or Postgres directly. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz). +> +> **`agentex-conversations`.** Owns `messages` (Mongo today, Postgres pending evaluation), `events`, and the Redis-streams fanout that carries them to UI subscribers. The realtime agent I/O surface — the SDK's `send_message`, `send_event`, and streaming `task_message_context` paths terminate here, all calling the backend's HTTP endpoints rather than connecting to MongoDB or Redis directly. Latency-sensitive, paired tightly with the streaming bus. Go service. Outbound: `agentex-tasks` (task scope validation), `agentex-auth` (authn / authz). +> +> **`agentex-tasks`.** Owns `tasks`, `task_agents`, `agent_task_tracker`; task lifecycle routes including the cron-triggered task creation surface that pairs with control-plane schedules. The canonical "what tasks exist and what state are they in" surface; read by `agentex-state` and `agentex-conversations`. Python service. Outbound: `agentex-control-plane` (agent registry lookups), `agentex-auth` (authn / authz). +> +> **`agentex-control-plane`.** The residual AgentEx backend after the three extractions above. Owns `agents`, `deployments`, `deployment_history`, `schedules`, `agent_api_keys`; agent registration and lifecycle, deployment registration (the in-cluster registry of what version is currently registered and serving — distinct from `sgp-agent-deploy`'s build/push pipeline), cron schedules, and agent-scoped API key management. Hosts the AgentEx Temporal worker on a single task queue. Other AgentEx services do not use Temporal post-decomposition; if a future service needs it, it gets its own service-specific queue. Python service. Outbound: `agentex-auth` (authn / authz). +> +> **`agentex-auth`.** Existing AgentEx-owned authentication and authorization service; exposes `/v1/authn` and `/v1/authz/{grant,revoke,check,search}`. All AgentEx services authenticate and authorize requests against `agentex-auth` directly. Outbound: `sgp-identity` (delegated identity verification). + +## Boundaries with adjacent services + +### `agentex-control-plane` ↔ `sgp-agent-deploy` + +The two services own complementary halves of the agent deployment lifecycle, with no direct API call between them. + +- **`sgp-agent-deploy`** owns the build and deployment pipeline: building container images (`agentex_cloud_builds`), creating and updating the Flux CRDs / Kubernetes manifests that schedule agent pods (`agentex_cloud_deploys`), and the permissions governing who can deploy what (`agentex_permissions`). Its scope ends when the agent pod is running. +- **`agentex-control-plane`** owns the runtime side: agent self-registration on pod startup, the in-cluster registry of what version is currently registered and serving (`deployments`, `deployment_history`), and runtime credentials (`agent_api_keys`). Its scope begins when the agent pod registers itself. + +The handoff is implicit: once `sgp-agent-deploy` reaches a "pod running" state via Flux, the agent process inside the pod calls `agentex-control-plane` to register itself. The two services never call each other directly. Permissions are similarly scope-divided: `agentex_permissions` in `sgp-agent-deploy` is a deploy-time concern (who can deploy what), distinct from `agent_api_keys` in `agentex-control-plane`, which is the runtime credential an agent uses to call back into the platform once it is running. + +## Forward-looking notes + +- **Checkpoint lift-out.** `agentex-state` bundles LangGraph checkpoints with general K/V state. If non-LangGraph graph runtimes are adopted, or if checkpoint write rate comes to dominate the service, consider lifting checkpoints into a separate `agentex-checkpoints` service. Revisit periodically as LangGraph usage grows. +- **`agentex-auth` ↔ `sgp-identity` via OneAuth.** `agentex-auth` remains a standalone AgentEx service in the near term; the OneAuth direction may eventually consolidate it into `sgp-identity`. If that happens, AgentEx services would call `sgp-identity` directly, and `agentex-auth`'s authz-policy responsibilities would need to move with it. Revisit when OneAuth direction firms up.