fro-bot · marcusrbrown · May 30, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
@@ -0,0 +1,215 @@
+---
+title: Discord slash command orchestration — permission checks, bootstrap wiring, and token safety
+date: 2026-05-27
+category: best-practices
+module: gateway
+problem_type: best_practice
+component: assistant
+severity: medium
+applies_when:
+  - A discord.js guild interaction checks bot permissions without the GuildMembers privileged intent
+  - A command uses dependency injection via a factory but real dispatch/bootstrap wiring is not integration-tested
+  - A short-lived credential (installation access token) crosses a process or HTTP boundary
+  - A multi-phase command can partially fail and needs documented recovery
+  - A find-or-create loop runs against a cache that concurrent operations mutate
+tags:
+  - discordjs
+  - app-permissions
+  - dependency-injection
+  - bootstrap-wiring
+  - integration-tests
+  - installation-access-token
+  - multi-phase-orchestration
+  - cache-race
+  - gateway
+---
+
+# Discord slash command orchestration — permission checks, bootstrap wiring, and token safety
+
+## Context
+
+Building the first user-facing gateway feature — the `/fro-bot add-project` Discord slash command — surfaced several non-obvious traps. The command orchestrates four subsystems: GitHub App authentication, a workspace clone over HTTP, Discord channel creation, and an S3 binding write. Each integration boundary hid a failure mode that the initial implementation and its unit tests missed, and that only surfaced under architecture review (Oracle) and automated review (Fro Bot).
+
+These are reusable patterns for any agent-driven Discord command, any factory-injected feature, and any service-to-service flow carrying a credential.
+
+## Guidance
+
+### 1. Discord permission checks: use `interaction.appPermissions`, not `guild.members.cache.get()`
+
+This is the highest-value learning. A permission check that resolves the bot's own `GuildMember` from the cache returns `undefined` whenever the bot runs **without** the privileged `GuildMembers` intent — which is the default, security-conscious posture. The result is a false-negative: the check fails even when the bot genuinely holds the permission, and the command aborts on a correctly-configured bot.
+
+**Broken pattern:**
+
+```ts
+function botHasRequiredPermissions(guild: Guild, botUserId: string): boolean {
+  const member = guild.members.cache.get(botUserId) // undefined without GuildMembers intent
+  if (member === undefined) return false            // false-negative abort
+  return member.permissions.has(PermissionFlagsBits.ManageChannels)
+}
+```
+
+**Correct pattern** — `interaction.appPermissions` is a `PermissionsBitField | null` that Discord populates reliably for every guild interaction, independent of gateway intents:
+
+```ts
+function botHasRequiredPermissions(appPermissions: PermissionsBitField | null): boolean {
+  if (appPermissions === null) return false // DM interaction — command is guild-only
+  return (
+    appPermissions.has(PermissionFlagsBits.ManageChannels) &&
+    appPermissions.has(PermissionFlagsBits.SendMessages)
+  )
+}
+
+// Call site:
+if (botHasRequiredPermissions(interaction.appPermissions) === false) {
+  await interaction.editReply({
+    content: `fro-bot needs **Manage Channels** and **Send Messages**. Re-invite at: ${installUrl}`,
+  })
+  return
+}
+```
+
+### 2. Test the real dispatch path, not just the handler
+
+The command was refactored to receive its dependencies through a factory (`createFroBotCommand(deps)`), but the bootstrap in `program.ts` never constructed and passed those deps. Every test still passed — because the tests called the orchestration function (`executeAddProject`) **directly**, bypassing the Discord dispatch path entirely. The wiring gap would have shipped broken-in-production code with a green suite. An architecture reviewer caught it, not the tests.
+
+The fix is an integration-level test that builds the **real** registry and dispatches a **real** interaction:
+
+```ts
+export function createFroBotCommand(deps: AddProjectDeps): SlashCommand {
+  const execute = (interaction: ChatInputCommandInteraction): Effect.Effect<void, Error> => {
+    const subcommand = interaction.options.getSubcommand(true)
+    if (subcommand === 'add-project') return executeAddProject(interaction, deps)
+    return Effect.fail(new Error(`Unknown subcommand: ${subcommand}`))
+  }
+  return {data, execute}
+}
+
+// Test the dispatch seam, not just executeAddProject:
+const registry = getCommandRegistry(makeMockDeps())
+await Effect.runPromise(dispatchCommand(mockInteraction, registry))
+expect(mockDeps.workspaceClient.clone).toHaveBeenCalled()
+```
+
+`getCommandRegistry` and `dispatchCommand` live in `packages/gateway/src/discord/commands/index.ts`; the factory and bootstrap wiring are in `commands/fro-bot.ts` and `program.ts` respectively.
+
+### 3. Installation access token (IAT) handling across an HTTP boundary
+
+The IAT (`ghs_*`, ~1hr lifetime) flows from GitHub App auth into an HTTP request body to the workspace service. Three rules emerged:
+
+- **Never log request or response body**, even on retry or error. The body carries the token. Mark it as a module invariant.
+- **Captured-logger test** asserting no `ghs_` string appears in any log line, iterating **all** error paths (timeout, network, http, parse, response-mismatch, clone-error, malformed-success) — not just the happy path.
+- **Response-path integrity check** — the response must correspond to the requested `owner/repo`, defending against a misbehaving or compromised service.
+- **`AbortSignal.timeout`** on the fetch (native fetch has no default timeout).
+
+```ts
+// SECURITY INVARIANT: never log request/response body — it carries the IAT.
+const response = await fetch(`${baseUrl}/clone`, {
+  method: 'POST',
+  headers: {'Content-Type': 'application/json'},
+  body, // contains token
+  signal: AbortSignal.timeout(timeoutMs),
+})
+
+// Compare the full path, not just the suffix. owner/repo are already
+// lowercased upstream, so the expected path is canonical; do not lowercase
+// the response path or a case-variant root could slip through.
+const expectedPath = `${EXPECTED_WORKSPACE_ROOT}/${owner}/${repo}`
+if (parsed.path !== expectedPath) {
+  return err({kind: 'response-mismatch'})
+}
+```
+
+A suffix-only check is the weaker form to avoid — `parsed.path.endsWith('/owner/repo')` accepts adversarial prefixes like `/etc/passwd/owner/repo`, letting a misbehaving agent bind a repo to an arbitrary filesystem location. Validate against the full expected path rooted at the known workspace root.
+
+```ts
+// Captured-logger security test — iterate ALL error paths:
+for (const line of spy.lines) {
+  expect(line).not.toContain('ghs_')
+}
+```
+
+### 4. Multi-phase orchestration with idempotent self-healing recovery
+
+The flow is a five-phase state machine: `PRE_FLIGHT → CLONING → CREATING_CHANNEL → WRITING_BINDING → READY`. The primary recovery path for the clone-exists/no-binding partial-failure shape is **re-running the command** — `/add-project` is idempotent for this case:
+
+- **Atomic clone guarantee**: the workspace-agent renames the temp dir to the final dest atomically. `repo-exists` from a clone call means the clone is fully complete — not partial.
+- **Resume keyed on confirmed absent binding**: when `repo-exists` is returned, the handler re-reads the bindings store. It only resumes (falls through to `CREATING_CHANNEL`) when the store confirms `data === null`. A store error (thrown or `success === false`) returns an internal-error reply and does **not** resume — resuming on an unreadable store would risk orphan channels if the binding write also fails.
+- **A defensive permission re-check at the side-effect boundary** (`CREATING_CHANNEL`) — permissions can change between `PRE_FLIGHT` and the first mutation.
+- **Preserve expensive prior work** — a failure after `CLONING` keeps the clone on disk; retrying the command will detect `repo-exists`, confirm no binding, and resume from `CREATING_CHANNEL`.
+
+Accepted v1 fallout and boundaries:
+
+- **Orphan channel on concurrent resume**: if two invocations both resume simultaneously, both create a channel, then both race to write the binding. The loser's `createBinding` is rejected with `BINDING_EXISTS_ERROR` (atomic IfNoneMatch write). The loser gets a "bound by a concurrent request / manual cleanup may be needed" reply. The orphan channel is a bounded, operator-visible artifact — not a silent data integrity issue.
+- **Shutdown gating**: new invocations are refused during drain (after `deferReply` so Discord gets its mandatory ack). In-flight runs that are hard-cut during a process restart are healed by the next retry via the resume path.
+
+```ts
+// Resume only on CONFIRMED absent binding — store errors do NOT resume:
+let existing: Awaited<ReturnType<typeof bindingsStore.getBindingByRepo>>
+try {
+  existing = await bindingsStore.getBindingByRepo(owner, repo)
+} catch {
+  await interaction.editReply({content: 'Internal error checking existing bindings. Please retry in a moment.'})
+  return
+}
+if (existing.success === false) {
+  await interaction.editReply({content: 'Internal error checking existing bindings. Please retry in a moment.'})
+  return
+}
+if (existing.data !== null) {
+  // Already bound — redirect.
+  await interaction.editReply({content: `\`${owner}/${repo}\` is already set up in <#${existing.data.channelId}>.`})
+  return
+}
+// Confirmed absent binding — resume from CREATING_CHANNEL.
+workspacePath = workspaceRepoPath(owner, repo)
+// fall through — do NOT return
+
+// Partial-write recovery names both keys for manual cleanup:
+if (error.code === 'BINDING_PARTIAL_WRITE_ERROR') {
+  await interaction.editReply({
+    content: `Partial write: primary written, index failed. Manual S3 cleanup:\n- Primary: \`${error.primaryKey}\`\n- Index: \`${error.indexKey}\``,
+  })
+}
+```
+
+### 5. Re-read live caches inside find-or-create collision loops
+
+Snapshotting `guild.channels.cache` once and iterating suffix candidates (`name`, `name-2`, …) against the frozen snapshot is a race. discord.js runs in a single Node process, so this is not OS-level concurrency — it's inter-await interleaving: between two `await` calls in the same event loop, the cache can be mutated by an incoming `channelCreate` gateway event or by another in-flight interaction. Two in-flight interactions interleaved across awaits can see the same snapshot, both attempt to create the same name, and the loser's Discord 50035 (duplicate name) rejection gets silently swallowed as a "transient error" — the binding ends up pointing to the wrong suffixed channel with no operator-visible warning.
+
+```ts
+for (const candidate of candidates) {
+  // Re-read the LIVE cache each iteration — a concurrent setup may have
+  // created this candidate since the loop began.
+  const existing = guild.channels.cache.find(
+    c => c.name.toLowerCase() === candidate.toLowerCase() && c.type === ChannelType.GuildText,
+  )
+  if (existing !== undefined) continue
+  const result = await tryCreate(guild, candidate)
+  // ...
+}
+```
+
+Pair this with a `tryCreate` that distinguishes **name-taken** (advance to next suffix) from **permission-denied** (short-circuit) from **create-failed / rate-limited** (surface a distinct error instead of burning through all suffixes).
+
+## Why This Matters
+
+| Trap | Cost if unknown |
+| --- | --- |
+| Wrong permission source | False-negative abort makes the command unusable on a correctly-configured bot |
+| Handler-only tests | Green CI ships broken-in-production bootstrap wiring |
+| IAT leakage | Security incident — a short-lived credential on disk/in logs |
+| No partial-failure recovery | Operators stranded mid-flow with orphaned channels and clones |
+| Cache-snapshot race | Silently binds to the wrong channel under concurrent use |
+
+## When to Apply
+
+- Any discord.js slash command that checks bot permissions
+- Any feature whose dependencies are injected via a factory and wired at bootstrap
+- Any service-to-service flow carrying a short-lived credential
+- Any multi-step orchestration with persistent side effects
+- Any find-or-create loop against a cache that concurrent operations mutate
+
+## Related
+
+- `docs/solutions/code-quality/architectural-issues-type-safety-and-resource-cleanup.md` — adjacent: deterministic recovery and `finally`-style cleanup orchestration (different subsystem, same discipline of making partial-failure paths explicit)
+- GitHub issue #646 — Gateway intent-posture flip (privileged intents opt-in), the posture that makes the `appPermissions` trap (learning #1) reachable
@@ -74,6 +74,17 @@ Existing deployments that need the privileged set must set this on the next
 deploy. The allowlist is intentionally narrow — operators cannot enable
 arbitrary Discord intents via this knob.
 
+## Known limitations (v1)
+
+- **`add-project` is Discord-only.** The orchestration runs inside the slash
+  command handler and requires a `ChatInputCommandInteraction`; there is no
+  programmatic surface (HTTP endpoint, CLI, or agent tool) that triggers the same
+  outcome. An autonomous agent cannot bind a repo without going through Discord.
+  Recovery is via idempotent retry — re-running the command resumes a partial
+  setup — rather than agent-callable recovery primitives. Extracting a
+  Discord-independent `addProject(request, deps)` primitive is deferred until a
+  non-Discord caller exists.
+
 ## Build
 
 ```bash