Skip to content

feat(runtime/acp): multi-session-per-agentrun#2

Open
wjueyao wants to merge 16 commits into
zoumo:mainfrom
wjueyao:feat/multi-session-per-agentrun
Open

feat(runtime/acp): multi-session-per-agentrun#2
wjueyao wants to merge 16 commits into
zoumo:mainfrom
wjueyao:feat/multi-session-per-agentrun

Conversation

@wjueyao
Copy link
Copy Markdown
Contributor

@wjueyao wjueyao commented May 18, 2026

Summary

Adds multi-session support to ACP agentruns. One agent process can now host multiple ACP sessions opened via agentrun new-session, each with its own cwd / model state / message history — multiplexed onto the same fork instead of paying fork+exec per task.

Motivation

A downstream caller (an Agent evaluation harness) spawns 3 ACP agentruns per case (executor / responder / reviewer). Running 41 cases means 123 fork+exec cycles of bunx claude-agent-acp (~200MB Node + ~300MB claude binary + MCP servers each). On memory-pressured dev hosts this thrashes swap and 2-min socket-wait deadlines start failing — the workflow becomes unusable.

With multi-session, the caller can keep 3 long-lived agentruns and open a fresh session per case via agentrun new-session. Total fork+exec drops from 3N → 3.

Design

ACP natively supports multiple session/new calls on a single agent process — each session has its own sessionId, cwd, and model state, multiplexed onto the same stdin/stdout. This PR builds the runtime + RPC + CLI layers on top of that primitive.

Per-session state. Manager.sessions map[acp.SessionId]*sessionState tracks each open session's (id, cwd, models). NewSession adds entries; EndSession removes them. The agent process owns its own session map; the runtime mirrors only what it needs to route RPCs.

Initial-session compat. Create() opens the first session via the normal handshake and populates both the new sessions map and the legacy m.sessionID / m.models fields. Existing callers of Prompt / Cancel / SetModel (which don't pass sessionId) keep working — these are now thin shims that delegate to the *Session variants with an empty sessionId.

Empty-sessionId convention. Client and wire layers pass sessionId: "" to mean "the initial session". Resolution happens in exactly one place: Manager's PromptSession / CancelSession / SetModelSession substitute m.sessionID when sessionId is empty. The Service layer is a thin pass-through — no per-layer "empty → initial" branching.

MCP scoping. Bundle-level mcpServers apply to every session; per-session NewSession(cwd, mcpServers) layers session-scoped servers on top (current behavior: append, no dedup — caller responsibility).

In-flight protection. EndSession refuses to release a session with active PromptSession work, tracked via a per-session refcount in Manager.inflight. Callers must CancelSession (or wait) before EndSession; otherwise the prompt completes against a session no longer in the map and bookkeeping drifts. The initial session cannot be EndSession'd at all — kill the whole agent to release it.

session/cancel wire compat. The method evolved from NullaryCommand (no params) to taking optional sessionId. To keep callers that send no params field working, the jsonrpc framework gains an OptionalUnaryCommand helper that delivers a zero-value Req instead of -32602 InvalidParams when params is absent.

Out of scope (follow-ups). state.json is still single-session — Phase and Session.Models reflect the initial session only. Per-session persistence is a follow-up; short-lived sessions don't need it, longer-lived multi-session deployments do. No max-sessions cap or idle timeout either; caller drives lifecycle today.

What changed

Runtime layer (pkg/agentrun/runtime/acp/runtime.go):

  • Manager.sessions map[acp.SessionId]*sessionState — replaces single-session field; each entry tracks id / cwd / per-session model state.
  • Manager.inflight map[acp.SessionId]int — per-session active-prompt counter; EndSession refuses to release a busy session.
  • Backward-compat: keep m.sessionID / m.models pointing to the first session so existing Prompt / Cancel / SetModel shims keep working unchanged.
  • New methods: NewSession(ctx, cwd, mcpServers), EndSession(sessionId), Sessions(), plus PromptSession / CancelSession / SetModelSession variants taking explicit sessionId (empty resolved to initial).

API layer (pkg/agentrun/api, pkg/ari/server):

  • Method constants + RPC types for agentrun/new-session, agentrun/end-session, agentrun/list-sessions.
  • Server handlers wire through to Manager. Existing agentrun/prompt / cancel / set-model gain optional sessionId — empty = initial session (back-compat).

Framework (pkg/jsonrpc):

  • ErrNoParams sentinel returned by the unmarshal callback when the request has no params field.
  • OptionalUnaryCommand helper — tolerates ErrNoParams, surfaces real JSON decode errors as InvalidParams. Used for session/cancel.

CLI (cmd/massctl/commands/agentrun):

  • massctl agentrun new-session <name> -w <ws> --cwd <path> — returns sessionId on stdout.
  • massctl agentrun end-session <name> -w <ws> --session-id <sid>.
  • massctl agentrun list-sessions <name> -w <ws>.
  • agentrun prompt gains --session-id flag.

ARI client (pkg/ari/client): typed wrappers for the new RPCs.

Tests:

  • pkg/agentrun/runtime/acp/runtime_internal_test.go: bookkeeping tests for sessions map (add / remove / list / busy / state isolation).
  • pkg/jsonrpc/server_test.go: wire-compat tests for OptionalUnaryCommand (missing params tolerated; malformed params surfaced).

Bundled fix (pkg/ari/api/domain.go): add explicit yaml: tags to AgentSpec fields. Without them, gopkg.in/yaml.v3 lowercases Go field names and silently drops camelCase keys like startupTimeoutSeconds on massctl agent apply -f round-trip. Tangentially related (camelCase fields matter for multi-session pool YAML), small enough to fold in.

Verification

Local go test ./... clean (one flake in TestAgentdRestartRecovery under full-suite contention; passes in isolation and on rerun — same behavior on main, not introduced by this PR).

Compat / migration

All existing single-session call sites unchanged. New methods are additive. State schema additive (sessions map) — no migration needed for daemons that don't open extra sessions. Pre-multi-session clients calling session/cancel with no params field continue to work via the OptionalUnaryCommand helper.

Next

A downstream pool adapter (wrapping new-session/end-session for the evaluation harness above) is implemented and unit-tested but blocked on this PR landing.

wjueyao and others added 8 commits May 18, 2026 17:56
…ons map)

First step of the multi-session refactor. ACP protocol natively supports
multiple sessions per agent process (per spec: session/new is baseline),
but mass runtime's Manager pinned a single sessionID, forcing callers
to fork+kill an agent process per session — wasteful for use cases
like mindpowers self-EDD that run N tasks back-to-back.

This commit adds the structural foundation without breaking existing
single-session callers:

1. New sessionState type capturing per-session protocol metadata
   (id / models / cwd). Mirrors the agent-side session map.

2. Manager.sessions map[acp.SessionId]*sessionState added. Create()'s
   initial session/new now populates both the legacy single-session
   fields (sessionID / models — kept for back-compat) AND the map.

3. NewSession(ctx, cwd, mcpServers) → SessionId opens additional ACP
   sessions on the running agent. Takes per-session cwd + optional
   per-session MCP server overrides layered on bundle defaults.

4. EndSession(sessionId) clears runtime tracking. Note: ACP has no
   explicit "end session" RPC yet, so this only removes the local map
   entry — the agent process retains its session state until cancelled
   or process exits. Initial session (Create's first one) cannot be
   ended via this method (kill the agent instead).

5. Sessions() returns active session IDs snapshot for callers that
   need to enumerate / clean up.

Out of scope for this commit (follow-ups):
- Prompt/Cancel/SetModel sessionId parameter (PromptSession etc.)
- State.json schema with sessions map (currently still single SessionID)
- ARI server: agentrun/new-session + agentrun/end-session RPC
- massctl: agentrun new-session/end-session subcommands
- Tests for the new methods (this is scaffold; tests follow with the
  RPC layer so we can exercise via the same interface mindpowers uses)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…riants

Step 2 of multi-session refactor (step 1: c16e57a sessions map scaffold).

Existing Prompt/Cancel/SetModel methods didn't take sessionId — they
used the Manager's single m.sessionID, which broke for callers holding
multiple sessions via the new NewSession method.

This commit adds explicit-sessionId variants and refactors the original
methods to be thin shims that pass m.sessionID through. Backward
compatibility preserved: existing callers (mass daemon ARI, massctl)
keep working without changes; new multi-session callers use the
*Session variants.

API additions:
  PromptSession(ctx, sessionId, prompt) → PromptResponse
  CancelSession(ctx, sessionId) error
  SetModelSession(ctx, sessionId, modelId) error

All three validate sessionId is in Manager.sessions map and return a
clear "session not found" error otherwise — avoids confusing low-level
ACP errors when callers pass a stale or wrong sessionId.

Per-session models state is now updated in sessions[id].models. The
legacy single-session Manager.models field is still mirrored for the
initial session only (matches existing GetState / SessionID() behavior).

State.json updates remain process-wide (single Phase field) — per-
session phase tracking is a follow-up state-schema change.

Compile-clean (go vet + go test -run NoneShouldMatch across pkg/agentrun).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 3 of multi-session refactor (steps 1-2: c16e57a, c17f298).

Add wire types so RPC callers can address specific sessions:
- SessionPromptParams: new optional SessionID field (empty = initial)
- SessionCancelParams: new struct, optional SessionID
- SessionSetModelParams: new optional SessionID field
- SessionNewParams + SessionNewResult: open additional session
- SessionEndParams + SessionEndResult: release runtime tracking
- SessionListResult: enumerate active sessions

Add method constants (api/methods.go):
- session/new, session/end, session/list

Backward-compat preserved: existing single-session callers pass empty
SessionID (or no params at all where the old Cancel had none), which
routes to the initial session — runtime layer (c17f298) maps empty
sessionId → m.sessionID.

No behavior change yet — server handlers + client methods are step 4-5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Steps 4-5 of multi-session refactor (steps 1-3: c16e57a, c17f298, 36594b8).

Server side (pkg/agentrun/server):
- Handler interface: Cancel signature change (now takes
  *SessionCancelParams), add NewSession / EndSession / ListSessions
- register.go: route session/new, session/end, session/list to new handlers
- service.go: Prompt/Cancel/SetModel now check params.SessionID
  - empty → call Manager's legacy shim (initial session)
  - non-empty → call Manager's *Session variant (validates membership)
- New handler impls translate wire params (SessionNewMcpServer →
  acp.McpServerStdio etc.) to runtime layer types

Client side (pkg/agentrun/client/client.go):
- Cancel sends *SessionCancelParams instead of nil (compat with new
  handler signature; empty SessionID still routes to initial session)
- New methods: NewSession, EndSession, ListSessions, CancelSession
- Existing Cancel kept for backward compatibility (delegates to initial)

Tests updated to match new Handler interface (stubRunService +
replayService in client tests).

Backward compatibility:
- Old clients calling Cancel with nil params: server handler tolerates
  nil req (treats as empty SessionID → initial session)
- Old clients calling Prompt without SessionID field: same routing
- All existing tests still pass; no callers broken

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sessions

Step 6 of multi-session refactor (steps 1-5: c16e57a, c17f298, 36594b8, e933264, f16480a inlining).

Adds daemon-level RPCs that forward to the running agent-run process:

API (pkg/ari/api):
- methods.go: MethodAgentRunNewSession / EndSession / ListSessions
- types.go: AgentRunNewSessionParams/Result + AgentRunEndSessionParams/Result
  + AgentRunListSessionsParams/Result
- AgentRunPromptParams: new optional SessionID field — forwards to
  agent-run's session/prompt SessionID

Server (pkg/ari/server):
- AgentRunService interface: 3 new methods
- service.go RegisterAgentRunService: route new-session / end-session /
  list-sessions on agentrun service
- server.go agentRunAdapter: handler impls that Connect to agent-run
  and call the runapi client methods added in e933264. Prompt forwards
  the new SessionID field through to agent-run.

Backward compatibility:
- AgentRunPromptParams.SessionID is optional (omitempty); pre-multi-
  session callers pass nothing → routes to agent's initial session

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 7 of multi-session refactor.

ARI client (pkg/ari/client):
- interfaces.go AgentRunOps: 4 new methods — PromptSession, NewSession,
  EndSession, ListSessions. Prompt() now delegates to PromptSession("")
  for backward compat.
- client.go agentRunOps: impl using new ARI methods from f16480a

massctl (cmd/massctl/commands/agentrun):
- new-session.go: `massctl ar new-session <name> -w <ws> --cwd <path>`
  prints sessionId to stdout (script-friendly)
- end-session.go: `massctl ar end-session <name> -w <ws> --session-id X`
- list-sessions.go: `massctl ar list-sessions <name> -w <ws>` prints
  active session ids one per line
- prompt.go: new --session-id flag routes prompts to specific session
- command.go: register the 3 new subcommands

Mocks (cmd/massctl/commands/{agent,workspace,workspace/create}/mock_test.go):
- mockAgentRunOps stubs the 4 new interface methods (no-ops; not
  exercised by callers in those packages)

Backward compatibility:
- `massctl ar prompt` without --session-id → routes to initial session
  (same as before via PromptSession with empty sessionId)
- existing tests pass; new commands have no dedicated tests yet (added
  in next commit)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Step 8 of multi-session refactor — final code commit.

Covers the parts of multi-session not exercised by existing tests
(which all need a real ACP agent process and are pre-existing flaky):

- TestSessions_EmptyWhenNoneRegistered — Sessions() on fresh Manager
- TestSessions_SnapshotIncludesRegistered — Sessions() enumerates map
- TestEndSession_RemovesFromMap — happy path
- TestEndSession_RefusesInitialSession — invariant: initial session
  cannot be ended without killing the agent
- TestEndSession_UnknownSessionErrors — clear error for stale ids
- TestPromptSession_RejectsBeforeAgentStarted — nil conn → clean error
- TestCancelSession_RejectsBeforeAgentStarted — same for cancel

Uses a test helper `newManagerForSessionTest` that bypasses Create()
to test bookkeeping logic without spawning an agent. Manager methods
that go through ACP (anything touching m.conn) return a documented
"agent not started" error for callers that race or skip Create().

The pre-existing TestRuntimeSuite tests using a fake bash agent
remain flaky (peer disconnect race) — confirmed unrelated to this
PR by running on origin/main baseline. Not addressed here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AgentSpec fields rely on gopkg.in/yaml.v3 for round-trip via
'massctl agent apply -f'. Without explicit `yaml:` tags, the library
lowercases Go field names (StartupTimeoutSeconds → startuptimeoutseconds),
silently dropping the camelCase keys that users and tooling write to
match the JSON tag spelling.

Add explicit yaml tags matching each json tag so that camelCase YAML
input is preserved byte-for-byte on apply.
Copy link
Copy Markdown
Contributor Author

@wjueyao wjueyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end — commits, full diff, and downstream impact. Posting a structured set of findings so they're discoverable as follow-ups.

What's solid

  • Back-compat layering: keeping m.sessionID / m.models pointing at the initial session and turning legacy Prompt/Cancel/SetModel into shims around the *Session variants is the right shape — existing callers stay unchanged, new callers opt into multi-session.
  • Commit hygiene: 8 atomic commits, each compile-clean with a focused scope (scaffold → variants → wire types → server → client → ARI → CLI → tests). Easy to review / bisect.
  • In-code documentation: sessionState rationale, why EndSession refuses initial, optional SessionID semantics — all spelled out in doc comments rather than hidden behind an external design doc.
  • Type-boundary discipline: every wire boundary casts acp.SessionIdstring explicitly. No bare strings leaking into the runtime.

Suggestions worth addressing in this PR

M1 — Service.Prompt (and Cancel / SetModel) dispatch duplication

In pkg/agentrun/server/service.go:

if req.SessionID == "" {
    resp, err = s.mgr.Prompt(ctx, req.Prompt)
} else {
    resp, err = s.mgr.PromptSession(ctx, resolveSessionID(req.SessionID), req.Prompt)
}

Manager.Prompt is now itself a shim around PromptSession(initialSessionID), so this two-branch dispatch does the same thing twice. The "empty string ⇒ initial session" rule is duplicated across client, Service, and Manager layers — three places to update if/when the convention changes.

Suggest consolidating at the Manager boundary, e.g.:

func (m *Manager) PromptSession(ctx, sessionID acp.SessionId, prompt) (...) {
    m.mu.Lock()
    if sessionID == "" {
        sessionID = m.sessionID
    }
    // ... existing logic
}

Then Service / Manager.Prompt collapse to one-liners, and the eventual deprecation of the empty-string convention is a one-line change. Same pattern applies to Cancel / SetModel.

M2 — session/cancel wire shape change (NullaryCommand → UnaryCommand)

pkg/agentrun/server/register.go switches "cancel" from NullaryCommand to UnaryCommand. The Service handler nil-guards req, which is good — but the wire-level behavior depends on what jsonrpc.UnaryCommand does when a JSON-RPC request has no params field (vs params: null vs params: {}).

The PR description's compat claim ("Old clients calling Cancel with nil params: server handler tolerates nil req") covers the Go-side req *T == nil case, but not necessarily the JSON envelope path where the framework decides what reaches the handler. A quick look at pkg/jsonrpc/server.go's unmarshal callback suggests it returns "missing params" when req.Params == nil, which UnaryCommand then wraps as ErrInvalidParams — meaning callers omitting the params field never reach the Service nil-guard.

Suggest verifying with a wire-level test that pins the contract regardless of jsonrpc-framework changes:

  • raw JSON-RPC client sends {"jsonrpc":"2.0","id":1,"method":"session/cancel"} (no params field) → expect 200, cancel routed to initial session.

If the framework does reject the request, a small OptionalUnaryCommand helper (or equivalent) in the jsonrpc package keeps the migration purely additive.

M3 — EndSession vs. in-flight prompt — contract is ambiguous

func (m *Manager) EndSession(sessionID acp.SessionId) error {
    // ... delete(m.sessions, sessionID) — no in-flight check
}

The doc comment says callers "must still call CancelSession (or wait for prompt completion) before EndSession" but EndSession itself doesn't enforce or detect it. If a caller ends a session mid-prompt:

  1. The in-flight PromptSession already passed its _, known := m.sessions[sessionID] check at start, so it continues to completion.
  2. The prompt's defer writes state.Phase = PhaseIdle after EndSession returned — bookkeeping diverges silently from the map.
  3. Trailing log lines reference a sessionID no longer in the map.

Suggest one of:

  • (a) Track an "in-flight prompts per session" counter; EndSession returns EBUSY if non-zero.
  • (b) Keep current behavior but drop the misleading "(or wait for prompt completion)" hint from the docstring — make it explicit that the caller is solely responsible for prompt ordering and the runtime does no enforcement.

For callers that fully own session lifecycle, (b) is acceptable; (a) is a nicer interface for less disciplined consumers.

Smaller items (could be follow-ups)

  • m1. MCP server merge has no dedup: merged := bundle; merged = append(merged, sessionLevel...) — if a session-scoped MCP server shares a name with a bundle-level one, both register. Consider dedup-by-Stdio.Name with session-scoped winning.
  • m2. ListSessions returns IDs only: for debugging it'd be nicer to return []{id, cwd, currentModelId} so operators can see what each session is doing without further RPCs. Cheap upgrade — Manager already has the data.
  • m3. Audit args for initial session omit sessionId: if sessionID != "" { auditArgs["sessionId"] = ... } — when replaying audit logs, initial-session cancels look like "untargeted cancel". Suggest always emitting the resolved session id (including initial).
  • m4. AgentSpec yaml-tag bundling: clean bug-fix and the PR description is honest about it, but semantically unrelated to multi-session. Cleaner as its own commit/PR — easier to revert independently if needed. Non-blocking.

Test coverage gaps

  • Unit tests for agentrun/server Service handlers: NewSession / EndSession / ListSessions validation paths (missing cwd, missing sessionId, empty req) have no direct coverage.
  • Unit tests for the ari/server adapter handlers: same gap; NewSession / EndSession / ListSessions validation + agent not running path uncovered.
  • Wire-compat test for M2 (above): paramless session/cancel request.
  • The TestRuntimeSuite flakiness self-disclosed in the PR description is confirmed pre-existing — agree it shouldn't block this PR, but worth a tracking issue so it doesn't become normalized.

Known follow-ups already self-disclosed (not blocking)

  • state.json schema has no per-session Phase / Models — multi-session sessions are not crash-recoverable. Fine for short-lived sessions; needs design before long-lived use.
  • No max-sessions cap; no idle timeout. Pathological callers can leak sessions until process OOM. Not urgent.

Bottom line

Design is sound, motivation grounded in measured fork+exec cost, and the back-compat layering lets the migration happen one caller at a time. M1 is structural cleanup, M2 is the back-compat claim most worth pinning down with a test, M3 is contract clarity. None of the three should require a substantial rework — likely a half-day to address.

wjueyao and others added 2 commits May 18, 2026 21:16
session/cancel evolved from NullaryCommand (no params) to taking an
optional sessionId. UnaryCommand strict-parses params and returns
InvalidParams when the field is absent — which means pre-multi-session
clients (sending the old paramless cancel) get a wire-level rejection
before the Service handler's nil-guard ever runs.

This commit makes the absent-params case typed and tolerable:

1. jsonrpc.ErrNoParams sentinel returned by the dispatcher's unmarshal
   callback when req.Params == nil. Replaces an ad-hoc fmt.Errorf —
   UnaryCommand semantics unchanged (still wraps as InvalidParams).

2. OptionalUnaryCommand helper — same shape as UnaryCommand but
   tolerates ErrNoParams (handler runs with zero-value Req). Malformed
   JSON unmarshal errors still surface as InvalidParams; tolerance is
   scoped to absent params.

3. agentrun/server/register.go: "session/cancel" switches to
   OptionalUnaryCommand. The Service handler reads req.SessionID
   directly — req is always non-nil from the helper.

Tests pin the contract:
- TestOptionalUnaryCommand_TolerateMissingParams: paramless call
  reaches handler with zero-value Req, no -32602 error.
- TestOptionalUnaryCommand_SurfacesUnmarshalError: malformed params
  still return InvalidParams.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uard

Addresses two issues from PR review feedback.

M1 — collapse empty-sessionID → initial resolution to Manager layer.

Before, the "empty sessionId means initial session" convention was
implemented in three places: client (passes ""), Service (branches on
"" vs non-empty and calls Manager.Prompt vs PromptSession), Manager
(Manager.Prompt is a shim that reads m.sessionID and calls PromptSession).
Three layers to keep in sync if the convention ever changes.

PromptSession / CancelSession / SetModelSession now do the resolution
themselves via a small resolveSessionLocked helper. Service collapses
to a one-line pass-through. Manager.Prompt / Cancel / SetModel reduce
to single-line shims around the *Session variants with "".

M3 — EndSession refuses sessions with in-flight prompt work.

Previously EndSession only checked map membership; ending a session
mid-prompt left the prompt running against a session no longer tracked,
silently drifting state.json Phase / log lines from the sessions map.

Add a per-session `inflight` counter on sessionState (collapsed from a
parallel `Manager.inflight` map suggested in initial draft — keeps
session bookkeeping in one place). PromptSession increments on entry
under m.mu, decrements via defer (looking up the session by id again
under lock to tolerate teardown). EndSession returns a busy error if
inflight > 0; caller must CancelSession (or wait) first.

Tests:
- TestEndSession_RefusesBusySession: inflight=1 → busy error, session
  stays in map.
- TestEndSession_AllowsAfterRefcountClears: once inflight returns to 0,
  EndSession succeeds — busy is per-state, not permanent.
- TestPromptSession_EmptySessionIDResolvesToInitial: empty sessionID
  reaches a "session not found" branch iff resolution didn't happen;
  test asserts it doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wjueyao and others added 4 commits May 19, 2026 11:25
Two blocking bugs surfaced by second-pass review:

1. state.Phase concurrent write race.

   Two PromptSession calls on different sessions race on completion:
   T1 finishes and writes Phase=Idle while T2 is still running. The
   field is single-valued but multi-session lifts that assumption,
   and the code blindly stamps Idle at every prompt end. state.json
   then lies about whether the agent is busy.

   Fix: Manager.activePrompts counts total in-flight PromptSession
   calls across all sessions. PromptSession bumps it under m.mu;
   the deferred decrement does the same. State writes use a new
   phaseFromActivePromptsLocked apply callback that reads the
   counter under writeState's lock — Phase reflects "any session
   running" rather than the last caller's local view.

2. Kill() / process-exit goroutine don't reset session bookkeeping.

   After the agent process dies, m.sessions retains every entry,
   m.conn stays non-nil, and m.sessionID still names a dead session.
   Sessions() returns stale IDs, EndSession silently succeeds on
   dead entries, and PromptSession passes its conn != nil guard
   before failing on the dead pipe — invariants broken across the
   teardown boundary.

   Fix: clearSessions() resets m.sessions, m.conn, m.sessionID,
   m.models, and m.activePrompts under m.mu. Called from Kill()
   and from the goroutine that observes process exit. In-flight
   RPCs hold their own captured conn pointer (read under m.mu at
   entry, used without lock); nil-ing m.conn here guards only
   *future* calls. Existing goroutines will fail on the closed
   pipe, which is the expected behavior.

Tests:
- TestPhaseFromActivePrompts: table-driven over the new apply
  callback; pins (activePrompts=0 → Idle), (1 → Running), (5 → Running).
- TestClearSessions: pre-populates Manager with sessions, sessionID,
  models, activePrompts; verifies all fields are zeroed after the
  helper runs.

The pre-existing TestRuntimeSuite flake (acp initialize: peer
disconnected before response, noted in zoumo#2's verification section)
is unaffected by these changes — internal unit tests pass
deterministically with -race -count=3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two ari-layer issues from second review pass:

1. Method-name divergence between agentrun and ari (D1).

   agentrun used `session/{new,end,list}` (resource/verb hierarchy),
   ari used `agentrun/{new,end,list}-session` (verb-on-noun). Both
   names are wire-visible; once shipped, third-party clients lock in
   whichever they hit.

   ari already uses `agentrun/task/{do,get,list,retry}` for task ops.
   Multi-session is the parallel concern, so rename to
   `agentrun/session/{new,end,list}` — one resource/verb hierarchy at
   this layer. Constants in ari/api/methods.go updated; ari/server/
   service.go registration and ari/server/server.go log messages
   follow. agentrun layer is unchanged (its `session/*` shape was
   already consistent with its scope).

2. ARI new/end/list-session bypass the recovery gate (B4).

   Existing Prompt handler goes through reserveIdleAgent which
   short-circuits with CodeRecoveryBlocked while
   ProcessManager.IsRecovering() is true — sessions opened mid-
   recovery would race with the recovery sweep's view of the agent
   process.

   The new session-lifecycle handlers skipped this check. Can't use
   reserveIdleAgent directly because it also requires Idle status,
   and multi-session permits a new session while another is mid-
   prompt (Status=Running). Add a focused rejectIfRecovering helper
   on agentRunAdapter: NewSession / EndSession / ListSessions short-
   circuit early; existing flow otherwise unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ions

Several smaller items surfaced in second review:

D5. Dead doc TODO. runtime.go's sessionState block referenced
docs/design/multi-session-per-agentrun.md (TODO: add design doc).
The doc was never written and the PR description / commit messages
carry the rationale now. Drop the dangling reference.

D2. Cwd contract divergence. Manager.NewSession allowed empty cwd
with a silent fallback to ResolveAgentRoot(bundleDir, cfg), but every
wire-level layer (CLI / ARI / agentrun service) already rejects empty
cwd before reaching runtime. The fallback was dead code from any real
caller's perspective; an in-process consumer relying on it would have
fragile semantics (cwd resolves differently per bundle). Reject empty
cwd at the runtime boundary — one contract, four enforcement points
instead of three plus a quiet exception.

B5. SetModelSession stale-pointer write. The pre-fix code captured
sess at the first lock, released the lock, ran the RPC, then re-
acquired the lock and mutated sess.models — but the session could
have been ended (or the whole map cleared by Kill) in the window.
The write succeeded against an orphan struct, silently no-op'ing the
in-memory model update. Re-lookup `m.sessions[sessionID]` under the
second lock and skip the mutation when the session is gone.

D6. Sessions() naming. The method returns []acp.SessionId, not
[]*sessionState — name read like it returned the full session
objects. Rename to SessionIDs(). Two callers (service.go + internal
test) updated. Test functions renamed in parallel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this fix, the Translator stamped every emitted AgentRunEvent
with the initial-session id (cached in t.sessionID). Multi-session
broke this assumption silently: agent_message events from a non-
initial session were labelled with the initial id, and `massctl
agentrun prompt --wait --session-id X` had no way to distinguish
session X's stream from a concurrent session Y's — events crossed,
and a turn_end from Y exited X's collector early.

Translator (pkg/agentrun/server/translator.go):
- NotifyTurnStart / NotifyUserPrompt / NotifyTurnEnd gain a leading
  sessionID parameter. Empty resolves to the initial session via a
  small resolveSessionLocked helper — back-compat for any caller
  outside the multi-session paths.
- broadcastEvent's per-event flow split into a generic
  broadcastEventForSession(sid, ev) and a one-line broadcastEvent
  wrapper that stamps the initial session (still used by error /
  operationAudit / stateChange / metadata events, which are
  process-wide).
- The run() goroutine consumes acp.SessionNotification (which
  carries SessionId per ACP spec); content events now stamp
  n.SessionId rather than t.sessionID.

Service (pkg/agentrun/server/service.go): Prompt forwards
req.SessionID into the three Notify calls.

cmd/mass/commands/run: the bootstrap seed-prompt explicitly targets
the initial session — pass "" to let the Translator resolve.

CLI (cmd/massctl/commands/agentrun/prompt.go): `--wait` mode filters
incoming events by --session-id when non-empty. Default --session-id=""
matches all events (single-session legacy behavior preserved).

Tests:
- TestService_NewSession_MissingCwdValidation — symmetric to runtime's
  cwd-required guard; pins the wire-layer rejection.
- TestService_EndSession_MissingSessionIDValidation — same.
- TestNotifyTurnStart_StampsSessionID / EmptySessionIDResolvesToInitial
  — Translator pins both the explicit-stamp and fallback paths.
- TestRun_StampsSessionIDFromNotification — content-event path
  carries n.SessionId; the regression most likely to bite under
  real multi-session traffic.

Existing translator_test.go callsites bulk-updated for the new
Notify* signatures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wjueyao and others added 2 commits May 19, 2026 13:15
… race

When Kill() races with an in-flight PromptSession, clearSessions resets
m.activePrompts to 0 before the prompt's deferred cleanup runs. The
deferred decrement then drove the counter to -1 — harmless for
phaseFromActivePromptsLocked's `> 0` check (still reads Idle) but a
broken invariant that the next contributor reading the code would
rightly question.

Extract the deferred cleanup into a named helper
decrementPromptInflight(sessionID) and guard the activePrompts
decrement with `if > 0`. sess.inflight already had a map-lookup guard
for the same race; this matches the symmetry.

Tests:
- TestDecrementPromptInflight_GuardsAgainstNegativeAfterClear pins the
  post-clearSessions race contract.
- TestDecrementPromptInflight_NormalPath confirms the happy path still
  decrements both fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tate

Companion to c96eca8 (activePrompts negative-guard). Same race class:

Kill() → clearSessions() nils m.conn and writes Phase=Stopped.
Meanwhile a PromptSession that was mid-conn.Prompt finally observes the
dead pipe and returns an error. Its outer code then runs:

    _ = m.writeState(m.phaseFromActivePromptsLocked, reason)

phaseFromActivePromptsLocked reads m.activePrompts and writes Phase
based on it (0 → Idle, >0 → Running) — silently clobbering Stopped.
state.json then says Idle/Running even though the agent is dead, and
operators / recovery code make the wrong call.

Fix: phaseFromActivePromptsLocked early-returns when m.conn is nil.
Kill / clearSessions owns lifecycle phase past that point; the apply
callback respects it instead of fighting it.

Tests:
- TestPhaseFromActivePromptsLocked_SkipsAfterKill pins the new contract:
  Phase=Stopped with conn=nil + activePrompts=1 stays Stopped.
- TestPhaseFromActivePrompts updated to set a non-nil conn so it
  exercises the live-agent branch as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant