Skip to content

fix(gateway): make /add-project recoverable and shutdown-aware#695

Merged
marcusrbrown merged 1 commit into
mainfrom
fix/add-project-recoverability
May 29, 2026
Merged

fix(gateway): make /add-project recoverable and shutdown-aware#695
marcusrbrown merged 1 commit into
mainfrom
fix/add-project-recoverability

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Makes /fro-bot add-project recover cleanly from partial failures and behave correctly during gateway shutdown.

Idempotent resume

When a repo is already cloned but has no channel binding — the state left behind when a prior run failed after cloning but before writing the binding — re-running /add-project now resumes from channel creation instead of dead-ending on repo-exists. Clones are atomic (the workspace-agent renames a temp dir onto the final path), so repo-exists reliably means a completed clone; the command re-reads the bindings store and only resumes when a binding is confirmed absent. A store-read failure reports an internal error rather than resuming, avoiding orphaned channels during an outage.

Shutdown awareness

New /add-project invocations are refused with a clear retry message while the gateway is draining for shutdown, instead of starting work that gets cut off mid-flight. Any run that is hard-cut is healed by the next retry via the resume path above.

Also

  • Fixed a missing early return on the disk-full clone-error branch.
  • Refreshed the orchestration solution doc to describe the idempotent recovery contract.

Gateway test suite: 295 passing.

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

One blocking issue found; everything else is solid.

Blocking issues

**Uninitialized when and **

In the new control flow, workspacePath is declared as let workspacePath: string (no initializer). Inside the cloneResult.success === false branch:

  • errorKind === 'clone-error' with code === 'repo-exists' + confirmed-absent binding: assigns workspacePath and falls through ✓
  • errorKind === 'clone-error' with code === 'repo-exists' + binding found or error: returns ✓
  • errorKind === 'clone-error' with other codes (disk-full, generic): returns ✓
  • errorKind === 'timeout', 'response-mismatch', or the final else: returns ✓

So at runtime, workspacePath is always either assigned or the function has returned — the TypeScript compiler should catch this at the type level if it doesn't. Please confirm that tsc --noEmit passes cleanly (or that CI already runs type-check). If TypeScript doesn't error but there's a path where workspacePath is used uninitialized due to a future refactor removing a return, this is a latent maintenance trap.

Recommended fix: initialize to a sentinel or reassert the invariant:

let workspacePath = '' // assigned in both live branches below; empty string is unreachable

or narrow the type with a post-branch assertion. This makes the intent explicit and removes the reliance on TypeScript's definite-assignment analysis holding across future edits.


Non-blocking concerns

Shutdown gate timing window — The gate checks isShuttingDown() once after deferReply. If SIGTERM arrives between the gate check and guild.channels.create (the first real side effect), the command proceeds. This is documented as an accepted trade-off (resume path heals it), but worth noting that isShuttingDown() is not re-checked at the side-effect boundary. Fine for v1; worth a comment if/when this code grows more phases.

EXPECTED_WORKSPACE_ROOT export — Exporting the constant alongside workspaceRepoPath widens the public surface slightly. Since workspaceRepoPath is now the canonical reference, callers should use the function, not the constant. Consider keeping the constant unexported and only exporting the function. Non-blocking since workspaceRepoPath is used internally too.

Orphan channel on concurrent loser — The reply says "bound by a concurrent request / manual cleanup may be needed" but doesn't tell the operator which channel to delete or where to find it. This is fine for v1 (the PR doc acknowledges it), but a follow-up to include the orphan channel ID in the message would improve operability.

Missing tests

  • No test for SIGTERM arriving between the shutdown-gate check and guild.channels.create — i.e., a command that passes the gate but then gets hard-killed. The resume path is tested, so this is low priority, but the window is real.
  • The "isShuttingDown absent (default) → proceeds normally" test doesn't assert on the final reply state, only that clone was called. Fine for its purpose, but it would be more confidence-inducing if it also checked that the function resolved without throwing.

Risk assessment (MED):

The overall direction is correct and the logic is sound. The one blocking item is a type-level question that is likely already caught by the compiler — but confirming it is cheap and removes a maintenance hazard. The concurrent-resume race is well-documented and bounded. The shutdown gate is a net improvement over the prior hard-kill behavior. No security exposure introduced.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26663885769
Cache hit
Session ses_18a4a1a7dffeBEYWEHZyjDGOqZ

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

One blocking issue found; everything else is solid.

Blocking issues

Uninitialized workspacePath — confirm definite-assignment holds

In the new control flow, workspacePath is declared as let workspacePath: string (no initializer). Inside the cloneResult.success === false branch, every sub-path either assigns workspacePath and falls through (repo-exists + confirmed-absent binding), or explicitly returns. At runtime the variable is always either assigned or the function has returned.

However, this relies entirely on TypeScript's definite-assignment analysis holding across future edits. If a future refactor drops a return in one of the error sub-branches, workspacePath will be an empty/undefined string at guild.channels.create with no compile-time guard.

Recommended fix: initialize to a sentinel so the invariant is explicit:

let workspacePath = '' // assigned in both live paths below before use

Please confirm tsc --noEmit passes cleanly in CI — if the compiler accepts the uninitialized form (strict mode / definite-assignment checks), this is lower priority, but the sentinel init costs nothing and survives refactors.


Non-blocking concerns

Shutdown gate timing window — The gate checks isShuttingDown() once after deferReply. If SIGTERM arrives between the gate check and guild.channels.create (the first real side effect), the command proceeds. Documented as accepted v1 trade-off (resume path heals it). Worth a comment at the side-effect boundary if this code grows more phases.

EXPECTED_WORKSPACE_ROOT export — Exporting the constant alongside workspaceRepoPath widens the public surface. Since workspaceRepoPath is now the canonical reference, consider keeping the constant unexported. Non-blocking since workspaceRepoPath is what callers should use.

Orphan channel on concurrent loser — The "bound by a concurrent request / manual cleanup may be needed" reply doesn't include the orphan channel ID. Fine for v1 (explicitly documented), but a follow-up that includes the channel ID in the message would improve operator experience.

Missing tests

  • No test for SIGTERM arriving between the shutdown-gate check and the first side effect (a command that passes the gate but then gets hard-killed). The resume path covers healing, so this is low priority.
  • The "isShuttingDown absent -> proceeds normally" test only asserts clone was called, not that the function resolved cleanly. Sufficient for its purpose but could be tightened.

Risk assessment (MED)

The logic is sound and the resume path is well-tested. The blocking item is a type-level question likely already caught by the compiler — confirming it is cheap and removes a maintenance hazard. The concurrent-resume race is bounded and documented. The shutdown gate is a net improvement over prior hard-kill behavior. No security exposure introduced.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26663885769
Cache hit
Session ses_18a4a1a7dffeBEYWEHZyjDGOqZ

- Resume from CREATING_CHANNEL when a repo is already cloned but has no
  binding, so re-running /add-project after a post-clone failure heals
  instead of dead-ending on repo-exists. Clones are atomic, so repo-exists
  always means a complete clone; the canonical workspace path is
  reconstructed via a shared workspaceRepoPath helper.
- Refuse new /add-project invocations while the gateway is draining for
  shutdown, with a clear retry message, instead of starting work that gets
  hard-killed mid-flight.
- Fix a missing return on the disk-full clone-error branch that would have
  fallen through with an unset workspace path.
@marcusrbrown marcusrbrown force-pushed the fix/add-project-recoverability branch from 7a3c89f to 2510b48 Compare May 29, 2026 21:56
@marcusrbrown
Copy link
Copy Markdown
Collaborator Author

On the blocking item: keeping let workspacePath: string uninitialized is intentional. TypeScript's definite-assignment analysis is the compile-time guard here — every path either assigns it or returns, and tsc --noEmit passes clean. Initializing to a sentinel (= '') would defeat that analysis: a future edit that dropped a return in an error branch would silently pass an empty path to channel creation instead of failing the build. I've added a comment at the declaration documenting the invariant so it survives refactors. Also un-exported EXPECTED_WORKSPACE_ROOT (callers use workspaceRepoPath()) and added a comment at the shutdown-gate boundary per the non-blocking notes.

Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

All feedback from the prior review has been addressed.

Blocking issues

None — the workspacePath invariant question is resolved. The author's reasoning is correct: leaving let workspacePath: string uninitialized is the right TypeScript pattern. tsc --noEmit passes clean, which means the compiler's definite-assignment analysis is the guard. A sentinel (= '') would defeat that — a future edit dropping a return would silently pass an empty string to channel creation instead of a build error. The added comment at the declaration documents the invariant explicitly.

Non-blocking concerns

None — all three non-blocking items from the prior review were addressed:

  • EXPECTED_WORKSPACE_ROOT is now module-private (comment updated, constant unexported)
  • Shutdown gate boundary has a comment noting the timing window
  • workspacePath declaration comment documents the invariant

Missing tests

None — test coverage is thorough: resume path, store-error path, rejection path, concurrent-loser path, shutdown gate on/off, and the full end-to-end retry-after-partial-failure scenario.

Risk assessment (LOW)

Logic is sound, all paths are tested, the concurrent-resume race is bounded and documented, and the shutdown gate is a clean improvement over hard-kill behavior. No security exposure introduced. Good to merge.


Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26664255810
Cache hit
Session ses_18a4a1a7dffeBEYWEHZyjDGOqZ

@marcusrbrown marcusrbrown merged commit 992e91f into main May 29, 2026
10 checks passed
@marcusrbrown marcusrbrown deleted the fix/add-project-recoverability branch May 29, 2026 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants