fix(gateway): make /add-project recoverable and shutdown-aware#695
Conversation
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
One blocking issue found; everything else is solid.
Blocking issues
**Uninitialized when and **
In the new control flow, workspacePath is declared as let workspacePath: string (no initializer). Inside the cloneResult.success === false branch:
errorKind === 'clone-error'withcode === 'repo-exists'+ confirmed-absent binding: assignsworkspacePathand falls through ✓errorKind === 'clone-error'withcode === 'repo-exists'+ binding found or error:returns ✓errorKind === 'clone-error'with other codes (disk-full, generic):returns ✓errorKind === 'timeout','response-mismatch', or the finalelse:returns ✓
So at runtime, workspacePath is always either assigned or the function has returned — the TypeScript compiler should catch this at the type level if it doesn't. Please confirm that tsc --noEmit passes cleanly (or that CI already runs type-check). If TypeScript doesn't error but there's a path where workspacePath is used uninitialized due to a future refactor removing a return, this is a latent maintenance trap.
Recommended fix: initialize to a sentinel or reassert the invariant:
let workspacePath = '' // assigned in both live branches below; empty string is unreachableor narrow the type with a post-branch assertion. This makes the intent explicit and removes the reliance on TypeScript's definite-assignment analysis holding across future edits.
Non-blocking concerns
Shutdown gate timing window — The gate checks isShuttingDown() once after deferReply. If SIGTERM arrives between the gate check and guild.channels.create (the first real side effect), the command proceeds. This is documented as an accepted trade-off (resume path heals it), but worth noting that isShuttingDown() is not re-checked at the side-effect boundary. Fine for v1; worth a comment if/when this code grows more phases.
EXPECTED_WORKSPACE_ROOT export — Exporting the constant alongside workspaceRepoPath widens the public surface slightly. Since workspaceRepoPath is now the canonical reference, callers should use the function, not the constant. Consider keeping the constant unexported and only exporting the function. Non-blocking since workspaceRepoPath is used internally too.
Orphan channel on concurrent loser — The reply says "bound by a concurrent request / manual cleanup may be needed" but doesn't tell the operator which channel to delete or where to find it. This is fine for v1 (the PR doc acknowledges it), but a follow-up to include the orphan channel ID in the message would improve operability.
Missing tests
- No test for SIGTERM arriving between the shutdown-gate check and
guild.channels.create— i.e., a command that passes the gate but then gets hard-killed. The resume path is tested, so this is low priority, but the window is real. - The "isShuttingDown absent (default) → proceeds normally" test doesn't assert on the final reply state, only that
clonewas called. Fine for its purpose, but it would be more confidence-inducing if it also checked that the function resolved without throwing.
Risk assessment (MED):
The overall direction is correct and the logic is sound. The one blocking item is a type-level question that is likely already caught by the compiler — but confirming it is cheap and removes a maintenance hazard. The concurrent-resume race is well-documented and bounded. The shutdown gate is a net improvement over the prior hard-kill behavior. No security exposure introduced.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26663885769 |
| Cache | hit |
| Session | ses_18a4a1a7dffeBEYWEHZyjDGOqZ |
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: CONDITIONAL
One blocking issue found; everything else is solid.
Blocking issues
Uninitialized workspacePath — confirm definite-assignment holds
In the new control flow, workspacePath is declared as let workspacePath: string (no initializer). Inside the cloneResult.success === false branch, every sub-path either assigns workspacePath and falls through (repo-exists + confirmed-absent binding), or explicitly returns. At runtime the variable is always either assigned or the function has returned.
However, this relies entirely on TypeScript's definite-assignment analysis holding across future edits. If a future refactor drops a return in one of the error sub-branches, workspacePath will be an empty/undefined string at guild.channels.create with no compile-time guard.
Recommended fix: initialize to a sentinel so the invariant is explicit:
let workspacePath = '' // assigned in both live paths below before usePlease confirm tsc --noEmit passes cleanly in CI — if the compiler accepts the uninitialized form (strict mode / definite-assignment checks), this is lower priority, but the sentinel init costs nothing and survives refactors.
Non-blocking concerns
Shutdown gate timing window — The gate checks isShuttingDown() once after deferReply. If SIGTERM arrives between the gate check and guild.channels.create (the first real side effect), the command proceeds. Documented as accepted v1 trade-off (resume path heals it). Worth a comment at the side-effect boundary if this code grows more phases.
EXPECTED_WORKSPACE_ROOT export — Exporting the constant alongside workspaceRepoPath widens the public surface. Since workspaceRepoPath is now the canonical reference, consider keeping the constant unexported. Non-blocking since workspaceRepoPath is what callers should use.
Orphan channel on concurrent loser — The "bound by a concurrent request / manual cleanup may be needed" reply doesn't include the orphan channel ID. Fine for v1 (explicitly documented), but a follow-up that includes the channel ID in the message would improve operator experience.
Missing tests
- No test for SIGTERM arriving between the shutdown-gate check and the first side effect (a command that passes the gate but then gets hard-killed). The resume path covers healing, so this is low priority.
- The "isShuttingDown absent -> proceeds normally" test only asserts
clonewas called, not that the function resolved cleanly. Sufficient for its purpose but could be tightened.
Risk assessment (MED)
The logic is sound and the resume path is well-tested. The blocking item is a type-level question likely already caught by the compiler — confirming it is cheap and removes a maintenance hazard. The concurrent-resume race is bounded and documented. The shutdown gate is a net improvement over prior hard-kill behavior. No security exposure introduced.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26663885769 |
| Cache | hit |
| Session | ses_18a4a1a7dffeBEYWEHZyjDGOqZ |
- Resume from CREATING_CHANNEL when a repo is already cloned but has no binding, so re-running /add-project after a post-clone failure heals instead of dead-ending on repo-exists. Clones are atomic, so repo-exists always means a complete clone; the canonical workspace path is reconstructed via a shared workspaceRepoPath helper. - Refuse new /add-project invocations while the gateway is draining for shutdown, with a clear retry message, instead of starting work that gets hard-killed mid-flight. - Fix a missing return on the disk-full clone-error branch that would have fallen through with an unset workspace path.
7a3c89f to
2510b48
Compare
|
On the blocking item: keeping |
fro-bot
left a comment
There was a problem hiding this comment.
Verdict: PASS
All feedback from the prior review has been addressed.
Blocking issues
None — the workspacePath invariant question is resolved. The author's reasoning is correct: leaving let workspacePath: string uninitialized is the right TypeScript pattern. tsc --noEmit passes clean, which means the compiler's definite-assignment analysis is the guard. A sentinel (= '') would defeat that — a future edit dropping a return would silently pass an empty string to channel creation instead of a build error. The added comment at the declaration documents the invariant explicitly.
Non-blocking concerns
None — all three non-blocking items from the prior review were addressed:
EXPECTED_WORKSPACE_ROOTis now module-private (comment updated, constant unexported)- Shutdown gate boundary has a comment noting the timing window
workspacePathdeclaration comment documents the invariant
Missing tests
None — test coverage is thorough: resume path, store-error path, rejection path, concurrent-loser path, shutdown gate on/off, and the full end-to-end retry-after-partial-failure scenario.
Risk assessment (LOW)
Logic is sound, all paths are tested, the concurrent-resume race is bounded and documented, and the shutdown gate is a clean improvement over hard-kill behavior. No security exposure introduced. Good to merge.
Run Summary
| Field | Value |
|---|---|
| Event | pull_request |
| Repository | fro-bot/agent |
| Run ID | 26664255810 |
| Cache | hit |
| Session | ses_18a4a1a7dffeBEYWEHZyjDGOqZ |
Makes
/fro-bot add-projectrecover cleanly from partial failures and behave correctly during gateway shutdown.Idempotent resume
When a repo is already cloned but has no channel binding — the state left behind when a prior run failed after cloning but before writing the binding — re-running
/add-projectnow resumes from channel creation instead of dead-ending onrepo-exists. Clones are atomic (the workspace-agent renames a temp dir onto the final path), sorepo-existsreliably means a completed clone; the command re-reads the bindings store and only resumes when a binding is confirmed absent. A store-read failure reports an internal error rather than resuming, avoiding orphaned channels during an outage.Shutdown awareness
New
/add-projectinvocations are refused with a clear retry message while the gateway is draining for shutdown, instead of starting work that gets cut off mid-flight. Any run that is hard-cut is healed by the next retry via the resume path above.Also
Gateway test suite: 295 passing.