feat(runtime/acp): multi-session-per-agentrun by wjueyao · Pull Request #2 · zoumo/mass

wjueyao · 2026-05-18T12:25:38Z

Summary

Adds multi-session support to ACP agentruns. One agent process can now host multiple ACP sessions opened via agentrun new-session, each with its own cwd / model state / message history — multiplexed onto the same fork instead of paying fork+exec per task.

Motivation

A downstream caller (an Agent evaluation harness) spawns 3 ACP agentruns per case (executor / responder / reviewer). Running 41 cases means 123 fork+exec cycles of bunx claude-agent-acp (~200MB Node + ~300MB claude binary + MCP servers each). On memory-pressured dev hosts this thrashes swap and 2-min socket-wait deadlines start failing — the workflow becomes unusable.

With multi-session, the caller can keep 3 long-lived agentruns and open a fresh session per case via agentrun new-session. Total fork+exec drops from 3N → 3.

Design

ACP natively supports multiple session/new calls on a single agent process — each session has its own sessionId, cwd, and model state, multiplexed onto the same stdin/stdout. This PR builds the runtime + RPC + CLI layers on top of that primitive.

Per-session state. Manager.sessions map[acp.SessionId]*sessionState tracks each open session's (id, cwd, models). NewSession adds entries; EndSession removes them. The agent process owns its own session map; the runtime mirrors only what it needs to route RPCs.

Initial-session compat. Create() opens the first session via the normal handshake and populates both the new sessions map and the legacy m.sessionID / m.models fields. Existing callers of Prompt / Cancel / SetModel (which don't pass sessionId) keep working — these are now thin shims that delegate to the *Session variants with an empty sessionId.

Empty-sessionId convention. Client and wire layers pass sessionId: "" to mean "the initial session". Resolution happens in exactly one place: Manager's PromptSession / CancelSession / SetModelSession substitute m.sessionID when sessionId is empty. The Service layer is a thin pass-through — no per-layer "empty → initial" branching.

MCP scoping. Bundle-level mcpServers apply to every session; per-session NewSession(cwd, mcpServers) layers session-scoped servers on top (current behavior: append, no dedup — caller responsibility).

In-flight protection. EndSession refuses to release a session with active PromptSession work, tracked via a per-session refcount in Manager.inflight. Callers must CancelSession (or wait) before EndSession; otherwise the prompt completes against a session no longer in the map and bookkeeping drifts. The initial session cannot be EndSession'd at all — kill the whole agent to release it.

session/cancel wire compat. The method evolved from NullaryCommand (no params) to taking optional sessionId. To keep callers that send no params field working, the jsonrpc framework gains an OptionalUnaryCommand helper that delivers a zero-value Req instead of -32602 InvalidParams when params is absent.

Out of scope (follow-ups). state.json is still single-session — Phase and Session.Models reflect the initial session only. Per-session persistence is a follow-up; short-lived sessions don't need it, longer-lived multi-session deployments do. No max-sessions cap or idle timeout either; caller drives lifecycle today.

What changed

Runtime layer (pkg/agentrun/runtime/acp/runtime.go):

Manager.sessions map[acp.SessionId]*sessionState — replaces single-session field; each entry tracks id / cwd / per-session model state.
Manager.inflight map[acp.SessionId]int — per-session active-prompt counter; EndSession refuses to release a busy session.
Backward-compat: keep m.sessionID / m.models pointing to the first session so existing Prompt / Cancel / SetModel shims keep working unchanged.
New methods: NewSession(ctx, cwd, mcpServers), EndSession(sessionId), Sessions(), plus PromptSession / CancelSession / SetModelSession variants taking explicit sessionId (empty resolved to initial).

API layer (pkg/agentrun/api, pkg/ari/server):

Method constants + RPC types for agentrun/new-session, agentrun/end-session, agentrun/list-sessions.
Server handlers wire through to Manager. Existing agentrun/prompt / cancel / set-model gain optional sessionId — empty = initial session (back-compat).

Framework (pkg/jsonrpc):

ErrNoParams sentinel returned by the unmarshal callback when the request has no params field.
OptionalUnaryCommand helper — tolerates ErrNoParams, surfaces real JSON decode errors as InvalidParams. Used for session/cancel.

CLI (cmd/massctl/commands/agentrun):

massctl agentrun new-session <name> -w <ws> --cwd <path> — returns sessionId on stdout.
massctl agentrun end-session <name> -w <ws> --session-id <sid>.
massctl agentrun list-sessions <name> -w <ws>.
agentrun prompt gains --session-id flag.

ARI client (pkg/ari/client): typed wrappers for the new RPCs.

Tests:

pkg/agentrun/runtime/acp/runtime_internal_test.go: bookkeeping tests for sessions map (add / remove / list / busy / state isolation).
pkg/jsonrpc/server_test.go: wire-compat tests for OptionalUnaryCommand (missing params tolerated; malformed params surfaced).

Bundled fix (pkg/ari/api/domain.go): add explicit yaml: tags to AgentSpec fields. Without them, gopkg.in/yaml.v3 lowercases Go field names and silently drops camelCase keys like startupTimeoutSeconds on massctl agent apply -f round-trip. Tangentially related (camelCase fields matter for multi-session pool YAML), small enough to fold in.

Verification

Local go test ./... clean (one flake in TestAgentdRestartRecovery under full-suite contention; passes in isolation and on rerun — same behavior on main, not introduced by this PR).

Compat / migration

All existing single-session call sites unchanged. New methods are additive. State schema additive (sessions map) — no migration needed for daemons that don't open extra sessions. Pre-multi-session clients calling session/cancel with no params field continue to work via the OptionalUnaryCommand helper.

…ons map) First step of the multi-session refactor. ACP protocol natively supports multiple sessions per agent process (per spec: session/new is baseline), but mass runtime's Manager pinned a single sessionID, forcing callers to fork+kill an agent process per session — wasteful for use cases like mindpowers self-EDD that run N tasks back-to-back. This commit adds the structural foundation without breaking existing single-session callers: 1. New sessionState type capturing per-session protocol metadata (id / models / cwd). Mirrors the agent-side session map. 2. Manager.sessions map[acp.SessionId]*sessionState added. Create()'s initial session/new now populates both the legacy single-session fields (sessionID / models — kept for back-compat) AND the map. 3. NewSession(ctx, cwd, mcpServers) → SessionId opens additional ACP sessions on the running agent. Takes per-session cwd + optional per-session MCP server overrides layered on bundle defaults. 4. EndSession(sessionId) clears runtime tracking. Note: ACP has no explicit "end session" RPC yet, so this only removes the local map entry — the agent process retains its session state until cancelled or process exits. Initial session (Create's first one) cannot be ended via this method (kill the agent instead). 5. Sessions() returns active session IDs snapshot for callers that need to enumerate / clean up. Out of scope for this commit (follow-ups): - Prompt/Cancel/SetModel sessionId parameter (PromptSession etc.) - State.json schema with sessions map (currently still single SessionID) - ARI server: agentrun/new-session + agentrun/end-session RPC - massctl: agentrun new-session/end-session subcommands - Tests for the new methods (this is scaffold; tests follow with the RPC layer so we can exercise via the same interface mindpowers uses) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…riants Step 2 of multi-session refactor (step 1: c16e57a sessions map scaffold). Existing Prompt/Cancel/SetModel methods didn't take sessionId — they used the Manager's single m.sessionID, which broke for callers holding multiple sessions via the new NewSession method. This commit adds explicit-sessionId variants and refactors the original methods to be thin shims that pass m.sessionID through. Backward compatibility preserved: existing callers (mass daemon ARI, massctl) keep working without changes; new multi-session callers use the *Session variants. API additions: PromptSession(ctx, sessionId, prompt) → PromptResponse CancelSession(ctx, sessionId) error SetModelSession(ctx, sessionId, modelId) error All three validate sessionId is in Manager.sessions map and return a clear "session not found" error otherwise — avoids confusing low-level ACP errors when callers pass a stale or wrong sessionId. Per-session models state is now updated in sessions[id].models. The legacy single-session Manager.models field is still mirrored for the initial session only (matches existing GetState / SessionID() behavior). State.json updates remain process-wide (single Phase field) — per- session phase tracking is a follow-up state-schema change. Compile-clean (go vet + go test -run NoneShouldMatch across pkg/agentrun). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 3 of multi-session refactor (steps 1-2: c16e57a, c17f298). Add wire types so RPC callers can address specific sessions: - SessionPromptParams: new optional SessionID field (empty = initial) - SessionCancelParams: new struct, optional SessionID - SessionSetModelParams: new optional SessionID field - SessionNewParams + SessionNewResult: open additional session - SessionEndParams + SessionEndResult: release runtime tracking - SessionListResult: enumerate active sessions Add method constants (api/methods.go): - session/new, session/end, session/list Backward-compat preserved: existing single-session callers pass empty SessionID (or no params at all where the old Cancel had none), which routes to the initial session — runtime layer (c17f298) maps empty sessionId → m.sessionID. No behavior change yet — server handlers + client methods are step 4-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Steps 4-5 of multi-session refactor (steps 1-3: c16e57a, c17f298, 36594b8). Server side (pkg/agentrun/server): - Handler interface: Cancel signature change (now takes *SessionCancelParams), add NewSession / EndSession / ListSessions - register.go: route session/new, session/end, session/list to new handlers - service.go: Prompt/Cancel/SetModel now check params.SessionID - empty → call Manager's legacy shim (initial session) - non-empty → call Manager's *Session variant (validates membership) - New handler impls translate wire params (SessionNewMcpServer → acp.McpServerStdio etc.) to runtime layer types Client side (pkg/agentrun/client/client.go): - Cancel sends *SessionCancelParams instead of nil (compat with new handler signature; empty SessionID still routes to initial session) - New methods: NewSession, EndSession, ListSessions, CancelSession - Existing Cancel kept for backward compatibility (delegates to initial) Tests updated to match new Handler interface (stubRunService + replayService in client tests). Backward compatibility: - Old clients calling Cancel with nil params: server handler tolerates nil req (treats as empty SessionID → initial session) - Old clients calling Prompt without SessionID field: same routing - All existing tests still pass; no callers broken Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sessions Step 6 of multi-session refactor (steps 1-5: c16e57a, c17f298, 36594b8, e933264, f16480a inlining). Adds daemon-level RPCs that forward to the running agent-run process: API (pkg/ari/api): - methods.go: MethodAgentRunNewSession / EndSession / ListSessions - types.go: AgentRunNewSessionParams/Result + AgentRunEndSessionParams/Result + AgentRunListSessionsParams/Result - AgentRunPromptParams: new optional SessionID field — forwards to agent-run's session/prompt SessionID Server (pkg/ari/server): - AgentRunService interface: 3 new methods - service.go RegisterAgentRunService: route new-session / end-session / list-sessions on agentrun service - server.go agentRunAdapter: handler impls that Connect to agent-run and call the runapi client methods added in e933264. Prompt forwards the new SessionID field through to agent-run. Backward compatibility: - AgentRunPromptParams.SessionID is optional (omitempty); pre-multi- session callers pass nothing → routes to agent's initial session Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 7 of multi-session refactor. ARI client (pkg/ari/client): - interfaces.go AgentRunOps: 4 new methods — PromptSession, NewSession, EndSession, ListSessions. Prompt() now delegates to PromptSession("") for backward compat. - client.go agentRunOps: impl using new ARI methods from f16480a massctl (cmd/massctl/commands/agentrun): - new-session.go: `massctl ar new-session <name> -w <ws> --cwd <path>` prints sessionId to stdout (script-friendly) - end-session.go: `massctl ar end-session <name> -w <ws> --session-id X` - list-sessions.go: `massctl ar list-sessions <name> -w <ws>` prints active session ids one per line - prompt.go: new --session-id flag routes prompts to specific session - command.go: register the 3 new subcommands Mocks (cmd/massctl/commands/{agent,workspace,workspace/create}/mock_test.go): - mockAgentRunOps stubs the 4 new interface methods (no-ops; not exercised by callers in those packages) Backward compatibility: - `massctl ar prompt` without --session-id → routes to initial session (same as before via PromptSession with empty sessionId) - existing tests pass; new commands have no dedicated tests yet (added in next commit) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Step 8 of multi-session refactor — final code commit. Covers the parts of multi-session not exercised by existing tests (which all need a real ACP agent process and are pre-existing flaky): - TestSessions_EmptyWhenNoneRegistered — Sessions() on fresh Manager - TestSessions_SnapshotIncludesRegistered — Sessions() enumerates map - TestEndSession_RemovesFromMap — happy path - TestEndSession_RefusesInitialSession — invariant: initial session cannot be ended without killing the agent - TestEndSession_UnknownSessionErrors — clear error for stale ids - TestPromptSession_RejectsBeforeAgentStarted — nil conn → clean error - TestCancelSession_RejectsBeforeAgentStarted — same for cancel Uses a test helper `newManagerForSessionTest` that bypasses Create() to test bookkeeping logic without spawning an agent. Manager methods that go through ACP (anything touching m.conn) return a documented "agent not started" error for callers that race or skip Create(). The pre-existing TestRuntimeSuite tests using a fake bash agent remain flaky (peer disconnect race) — confirmed unrelated to this PR by running on origin/main baseline. Not addressed here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AgentSpec fields rely on gopkg.in/yaml.v3 for round-trip via 'massctl agent apply -f'. Without explicit `yaml:` tags, the library lowercases Go field names (StartupTimeoutSeconds → startuptimeoutseconds), silently dropping the camelCase keys that users and tooling write to match the JSON tag spelling. Add explicit yaml tags matching each json tag so that camelCase YAML input is preserved byte-for-byte on apply.

wjueyao

Reviewed end-to-end — commits, full diff, and downstream impact. Posting a structured set of findings so they're discoverable as follow-ups.

What's solid

Back-compat layering: keeping m.sessionID / m.models pointing at the initial session and turning legacy Prompt/Cancel/SetModel into shims around the *Session variants is the right shape — existing callers stay unchanged, new callers opt into multi-session.
Commit hygiene: 8 atomic commits, each compile-clean with a focused scope (scaffold → variants → wire types → server → client → ARI → CLI → tests). Easy to review / bisect.
In-code documentation: sessionState rationale, why EndSession refuses initial, optional SessionID semantics — all spelled out in doc comments rather than hidden behind an external design doc.
Type-boundary discipline: every wire boundary casts acp.SessionId ↔ string explicitly. No bare strings leaking into the runtime.

Suggestions worth addressing in this PR

M1 — `Service.Prompt` (and `Cancel` / `SetModel`) dispatch duplication

In pkg/agentrun/server/service.go:

if req.SessionID == "" {
    resp, err = s.mgr.Prompt(ctx, req.Prompt)
} else {
    resp, err = s.mgr.PromptSession(ctx, resolveSessionID(req.SessionID), req.Prompt)
}

Manager.Prompt is now itself a shim around PromptSession(initialSessionID), so this two-branch dispatch does the same thing twice. The "empty string ⇒ initial session" rule is duplicated across client, Service, and Manager layers — three places to update if/when the convention changes.

Suggest consolidating at the Manager boundary, e.g.:

func (m *Manager) PromptSession(ctx, sessionID acp.SessionId, prompt) (...) {
    m.mu.Lock()
    if sessionID == "" {
        sessionID = m.sessionID
    }
    // ... existing logic
}

Then Service / Manager.Prompt collapse to one-liners, and the eventual deprecation of the empty-string convention is a one-line change. Same pattern applies to Cancel / SetModel.

M2 — `session/cancel` wire shape change (NullaryCommand → UnaryCommand)

pkg/agentrun/server/register.go switches "cancel" from NullaryCommand to UnaryCommand. The Service handler nil-guards req, which is good — but the wire-level behavior depends on what jsonrpc.UnaryCommand does when a JSON-RPC request has no params field (vs params: null vs params: {}).

The PR description's compat claim ("Old clients calling Cancel with nil params: server handler tolerates nil req") covers the Go-side req *T == nil case, but not necessarily the JSON envelope path where the framework decides what reaches the handler. A quick look at pkg/jsonrpc/server.go's unmarshal callback suggests it returns "missing params" when req.Params == nil, which UnaryCommand then wraps as ErrInvalidParams — meaning callers omitting the params field never reach the Service nil-guard.

Suggest verifying with a wire-level test that pins the contract regardless of jsonrpc-framework changes:

raw JSON-RPC client sends {"jsonrpc":"2.0","id":1,"method":"session/cancel"} (no params field) → expect 200, cancel routed to initial session.

If the framework does reject the request, a small OptionalUnaryCommand helper (or equivalent) in the jsonrpc package keeps the migration purely additive.

M3 — `EndSession` vs. in-flight prompt — contract is ambiguous

func (m *Manager) EndSession(sessionID acp.SessionId) error {
    // ... delete(m.sessions, sessionID) — no in-flight check
}

The doc comment says callers "must still call CancelSession (or wait for prompt completion) before EndSession" but EndSession itself doesn't enforce or detect it. If a caller ends a session mid-prompt:

The in-flight PromptSession already passed its _, known := m.sessions[sessionID] check at start, so it continues to completion.
The prompt's defer writes state.Phase = PhaseIdle after EndSession returned — bookkeeping diverges silently from the map.
Trailing log lines reference a sessionID no longer in the map.

Suggest one of:

(a) Track an "in-flight prompts per session" counter; EndSession returns EBUSY if non-zero.
(b) Keep current behavior but drop the misleading "(or wait for prompt completion)" hint from the docstring — make it explicit that the caller is solely responsible for prompt ordering and the runtime does no enforcement.

For callers that fully own session lifecycle, (b) is acceptable; (a) is a nicer interface for less disciplined consumers.

Smaller items (could be follow-ups)

m1. MCP server merge has no dedup: merged := bundle; merged = append(merged, sessionLevel...) — if a session-scoped MCP server shares a name with a bundle-level one, both register. Consider dedup-by-Stdio.Name with session-scoped winning.
m2. ListSessions returns IDs only: for debugging it'd be nicer to return []{id, cwd, currentModelId} so operators can see what each session is doing without further RPCs. Cheap upgrade — Manager already has the data.
m3. Audit args for initial session omit sessionId: if sessionID != "" { auditArgs["sessionId"] = ... } — when replaying audit logs, initial-session cancels look like "untargeted cancel". Suggest always emitting the resolved session id (including initial).
m4. AgentSpec yaml-tag bundling: clean bug-fix and the PR description is honest about it, but semantically unrelated to multi-session. Cleaner as its own commit/PR — easier to revert independently if needed. Non-blocking.

Test coverage gaps

Unit tests for agentrun/server Service handlers: NewSession / EndSession / ListSessions validation paths (missing cwd, missing sessionId, empty req) have no direct coverage.
Unit tests for the ari/server adapter handlers: same gap; NewSession / EndSession / ListSessions validation + agent not running path uncovered.
Wire-compat test for M2 (above): paramless session/cancel request.
The TestRuntimeSuite flakiness self-disclosed in the PR description is confirmed pre-existing — agree it shouldn't block this PR, but worth a tracking issue so it doesn't become normalized.

Known follow-ups already self-disclosed (not blocking)

state.json schema has no per-session Phase / Models — multi-session sessions are not crash-recoverable. Fine for short-lived sessions; needs design before long-lived use.
No max-sessions cap; no idle timeout. Pathological callers can leak sessions until process OOM. Not urgent.

Bottom line

Design is sound, motivation grounded in measured fork+exec cost, and the back-compat layering lets the migration happen one caller at a time. M1 is structural cleanup, M2 is the back-compat claim most worth pinning down with a test, M3 is contract clarity. None of the three should require a substantial rework — likely a half-day to address.

session/cancel evolved from NullaryCommand (no params) to taking an optional sessionId. UnaryCommand strict-parses params and returns InvalidParams when the field is absent — which means pre-multi-session clients (sending the old paramless cancel) get a wire-level rejection before the Service handler's nil-guard ever runs. This commit makes the absent-params case typed and tolerable: 1. jsonrpc.ErrNoParams sentinel returned by the dispatcher's unmarshal callback when req.Params == nil. Replaces an ad-hoc fmt.Errorf — UnaryCommand semantics unchanged (still wraps as InvalidParams). 2. OptionalUnaryCommand helper — same shape as UnaryCommand but tolerates ErrNoParams (handler runs with zero-value Req). Malformed JSON unmarshal errors still surface as InvalidParams; tolerance is scoped to absent params. 3. agentrun/server/register.go: "session/cancel" switches to OptionalUnaryCommand. The Service handler reads req.SessionID directly — req is always non-nil from the helper. Tests pin the contract: - TestOptionalUnaryCommand_TolerateMissingParams: paramless call reaches handler with zero-value Req, no -32602 error. - TestOptionalUnaryCommand_SurfacesUnmarshalError: malformed params still return InvalidParams. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uard Addresses two issues from PR review feedback. M1 — collapse empty-sessionID → initial resolution to Manager layer. Before, the "empty sessionId means initial session" convention was implemented in three places: client (passes ""), Service (branches on "" vs non-empty and calls Manager.Prompt vs PromptSession), Manager (Manager.Prompt is a shim that reads m.sessionID and calls PromptSession). Three layers to keep in sync if the convention ever changes. PromptSession / CancelSession / SetModelSession now do the resolution themselves via a small resolveSessionLocked helper. Service collapses to a one-line pass-through. Manager.Prompt / Cancel / SetModel reduce to single-line shims around the *Session variants with "". M3 — EndSession refuses sessions with in-flight prompt work. Previously EndSession only checked map membership; ending a session mid-prompt left the prompt running against a session no longer tracked, silently drifting state.json Phase / log lines from the sessions map. Add a per-session `inflight` counter on sessionState (collapsed from a parallel `Manager.inflight` map suggested in initial draft — keeps session bookkeeping in one place). PromptSession increments on entry under m.mu, decrements via defer (looking up the session by id again under lock to tolerate teardown). EndSession returns a busy error if inflight > 0; caller must CancelSession (or wait) first. Tests: - TestEndSession_RefusesBusySession: inflight=1 → busy error, session stays in map. - TestEndSession_AllowsAfterRefcountClears: once inflight returns to 0, EndSession succeeds — busy is per-state, not permanent. - TestPromptSession_EmptySessionIDResolvesToInitial: empty sessionID reaches a "session not found" branch iff resolution didn't happen; test asserts it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two blocking bugs surfaced by second-pass review: 1. state.Phase concurrent write race. Two PromptSession calls on different sessions race on completion: T1 finishes and writes Phase=Idle while T2 is still running. The field is single-valued but multi-session lifts that assumption, and the code blindly stamps Idle at every prompt end. state.json then lies about whether the agent is busy. Fix: Manager.activePrompts counts total in-flight PromptSession calls across all sessions. PromptSession bumps it under m.mu; the deferred decrement does the same. State writes use a new phaseFromActivePromptsLocked apply callback that reads the counter under writeState's lock — Phase reflects "any session running" rather than the last caller's local view. 2. Kill() / process-exit goroutine don't reset session bookkeeping. After the agent process dies, m.sessions retains every entry, m.conn stays non-nil, and m.sessionID still names a dead session. Sessions() returns stale IDs, EndSession silently succeeds on dead entries, and PromptSession passes its conn != nil guard before failing on the dead pipe — invariants broken across the teardown boundary. Fix: clearSessions() resets m.sessions, m.conn, m.sessionID, m.models, and m.activePrompts under m.mu. Called from Kill() and from the goroutine that observes process exit. In-flight RPCs hold their own captured conn pointer (read under m.mu at entry, used without lock); nil-ing m.conn here guards only *future* calls. Existing goroutines will fail on the closed pipe, which is the expected behavior. Tests: - TestPhaseFromActivePrompts: table-driven over the new apply callback; pins (activePrompts=0 → Idle), (1 → Running), (5 → Running). - TestClearSessions: pre-populates Manager with sessions, sessionID, models, activePrompts; verifies all fields are zeroed after the helper runs. The pre-existing TestRuntimeSuite flake (acp initialize: peer disconnected before response, noted in zoumo#2's verification section) is unaffected by these changes — internal unit tests pass deterministically with -race -count=3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two ari-layer issues from second review pass: 1. Method-name divergence between agentrun and ari (D1). agentrun used `session/{new,end,list}` (resource/verb hierarchy), ari used `agentrun/{new,end,list}-session` (verb-on-noun). Both names are wire-visible; once shipped, third-party clients lock in whichever they hit. ari already uses `agentrun/task/{do,get,list,retry}` for task ops. Multi-session is the parallel concern, so rename to `agentrun/session/{new,end,list}` — one resource/verb hierarchy at this layer. Constants in ari/api/methods.go updated; ari/server/ service.go registration and ari/server/server.go log messages follow. agentrun layer is unchanged (its `session/*` shape was already consistent with its scope). 2. ARI new/end/list-session bypass the recovery gate (B4). Existing Prompt handler goes through reserveIdleAgent which short-circuits with CodeRecoveryBlocked while ProcessManager.IsRecovering() is true — sessions opened mid- recovery would race with the recovery sweep's view of the agent process. The new session-lifecycle handlers skipped this check. Can't use reserveIdleAgent directly because it also requires Idle status, and multi-session permits a new session while another is mid- prompt (Status=Running). Add a focused rejectIfRecovering helper on agentRunAdapter: NewSession / EndSession / ListSessions short- circuit early; existing flow otherwise unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ions Several smaller items surfaced in second review: D5. Dead doc TODO. runtime.go's sessionState block referenced docs/design/multi-session-per-agentrun.md (TODO: add design doc). The doc was never written and the PR description / commit messages carry the rationale now. Drop the dangling reference. D2. Cwd contract divergence. Manager.NewSession allowed empty cwd with a silent fallback to ResolveAgentRoot(bundleDir, cfg), but every wire-level layer (CLI / ARI / agentrun service) already rejects empty cwd before reaching runtime. The fallback was dead code from any real caller's perspective; an in-process consumer relying on it would have fragile semantics (cwd resolves differently per bundle). Reject empty cwd at the runtime boundary — one contract, four enforcement points instead of three plus a quiet exception. B5. SetModelSession stale-pointer write. The pre-fix code captured sess at the first lock, released the lock, ran the RPC, then re- acquired the lock and mutated sess.models — but the session could have been ended (or the whole map cleared by Kill) in the window. The write succeeded against an orphan struct, silently no-op'ing the in-memory model update. Re-lookup `m.sessions[sessionID]` under the second lock and skip the mutation when the session is gone. D6. Sessions() naming. The method returns []acp.SessionId, not []*sessionState — name read like it returned the full session objects. Rename to SessionIDs(). Two callers (service.go + internal test) updated. Test functions renamed in parallel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Before this fix, the Translator stamped every emitted AgentRunEvent with the initial-session id (cached in t.sessionID). Multi-session broke this assumption silently: agent_message events from a non- initial session were labelled with the initial id, and `massctl agentrun prompt --wait --session-id X` had no way to distinguish session X's stream from a concurrent session Y's — events crossed, and a turn_end from Y exited X's collector early. Translator (pkg/agentrun/server/translator.go): - NotifyTurnStart / NotifyUserPrompt / NotifyTurnEnd gain a leading sessionID parameter. Empty resolves to the initial session via a small resolveSessionLocked helper — back-compat for any caller outside the multi-session paths. - broadcastEvent's per-event flow split into a generic broadcastEventForSession(sid, ev) and a one-line broadcastEvent wrapper that stamps the initial session (still used by error / operationAudit / stateChange / metadata events, which are process-wide). - The run() goroutine consumes acp.SessionNotification (which carries SessionId per ACP spec); content events now stamp n.SessionId rather than t.sessionID. Service (pkg/agentrun/server/service.go): Prompt forwards req.SessionID into the three Notify calls. cmd/mass/commands/run: the bootstrap seed-prompt explicitly targets the initial session — pass "" to let the Translator resolve. CLI (cmd/massctl/commands/agentrun/prompt.go): `--wait` mode filters incoming events by --session-id when non-empty. Default --session-id="" matches all events (single-session legacy behavior preserved). Tests: - TestService_NewSession_MissingCwdValidation — symmetric to runtime's cwd-required guard; pins the wire-layer rejection. - TestService_EndSession_MissingSessionIDValidation — same. - TestNotifyTurnStart_StampsSessionID / EmptySessionIDResolvesToInitial — Translator pins both the explicit-stamp and fallback paths. - TestRun_StampsSessionIDFromNotification — content-event path carries n.SessionId; the regression most likely to bite under real multi-session traffic. Existing translator_test.go callsites bulk-updated for the new Notify* signatures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… race When Kill() races with an in-flight PromptSession, clearSessions resets m.activePrompts to 0 before the prompt's deferred cleanup runs. The deferred decrement then drove the counter to -1 — harmless for phaseFromActivePromptsLocked's `> 0` check (still reads Idle) but a broken invariant that the next contributor reading the code would rightly question. Extract the deferred cleanup into a named helper decrementPromptInflight(sessionID) and guard the activePrompts decrement with `if > 0`. sess.inflight already had a map-lookup guard for the same race; this matches the symmetry. Tests: - TestDecrementPromptInflight_GuardsAgainstNegativeAfterClear pins the post-clearSessions race contract. - TestDecrementPromptInflight_NormalPath confirms the happy path still decrements both fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tate Companion to c96eca8 (activePrompts negative-guard). Same race class: Kill() → clearSessions() nils m.conn and writes Phase=Stopped. Meanwhile a PromptSession that was mid-conn.Prompt finally observes the dead pipe and returns an error. Its outer code then runs: _ = m.writeState(m.phaseFromActivePromptsLocked, reason) phaseFromActivePromptsLocked reads m.activePrompts and writes Phase based on it (0 → Idle, >0 → Running) — silently clobbering Stopped. state.json then says Idle/Running even though the agent is dead, and operators / recovery code make the wrong call. Fix: phaseFromActivePromptsLocked early-returns when m.conn is nil. Kill / clearSessions owns lifecycle phase past that point; the apply callback respects it instead of fighting it. Tests: - TestPhaseFromActivePromptsLocked_SkipsAfterKill pins the new contract: Phase=Stopped with conn=nil + activePrompts=1 stays Stopped. - TestPhaseFromActivePrompts updated to set a non-nil conn so it exercises the live-agent branch as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wjueyao and others added 8 commits May 18, 2026 17:56

wjueyao commented May 18, 2026

View reviewed changes

wjueyao and others added 2 commits May 18, 2026 21:16

wjueyao mentioned this pull request May 19, 2026

Flaky tests: send-on-closed-channel race in sourcegraph/jsonrpc2 Conn under full-suite load #3

Open

wjueyao and others added 4 commits May 19, 2026 11:25

This was referenced May 19, 2026

ListSessions returns IDs only — add cwd, currentModelId, inflight for operator visibility #4

Open

Per-session MCP overrides only support stdio — bundle supports stdio/sse/http #5

Open

NewSession ctx-cancel race can leak agent-side session state #6

Open

wjueyao and others added 2 commits May 19, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime/acp): multi-session-per-agentrun#2

feat(runtime/acp): multi-session-per-agentrun#2
wjueyao wants to merge 16 commits into
zoumo:mainfrom
wjueyao:feat/multi-session-per-agentrun

wjueyao commented May 18, 2026 •

edited

Loading

Uh oh!

wjueyao left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wjueyao commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

What changed

Verification

Compat / migration

Next

Uh oh!

wjueyao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

What's solid

Suggestions worth addressing in this PR

M1 — Service.Prompt (and Cancel / SetModel) dispatch duplication

M2 — session/cancel wire shape change (NullaryCommand → UnaryCommand)

M3 — EndSession vs. in-flight prompt — contract is ambiguous

Smaller items (could be follow-ups)

Test coverage gaps

Known follow-ups already self-disclosed (not blocking)

Bottom line

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wjueyao commented May 18, 2026 •

edited

Loading

wjueyao left a comment •

edited

Loading

M1 — `Service.Prompt` (and `Cancel` / `SetModel`) dispatch duplication

M2 — `session/cancel` wire shape change (NullaryCommand → UnaryCommand)

M3 — `EndSession` vs. in-flight prompt — contract is ambiguous