Skip to content

feat(gateway): signed announce webhook for control-plane presence messages#697

Merged
marcusrbrown merged 8 commits into
mainfrom
feat/gateway-announce-webhook
May 30, 2026
Merged

feat(gateway): signed announce webhook for control-plane presence messages#697
marcusrbrown merged 8 commits into
mainfrom
feat/gateway-announce-webhook

Conversation

@marcusrbrown
Copy link
Copy Markdown
Collaborator

Closes #671.

Adds POST /v1/announce to the gateway — the first HTTP ingress — so the control plane can post presence messages in Discord as the Fro Bot user (surveys completing, collaboration invitations accepted) rather than through a webhook bot. The gateway already holds the Discord token and a live client, so it turns a signed request into a Discord embed.

Authentication

HMAC-SHA256 over timestamp + "." + rawBody (Stripe-style) with a shared secret, constant-time compared. The signature binds the timestamp, and the X-Gateway-Timestamp header must equal the body fired_at by exact-string match. Requests outside a ±5 minute window are rejected. Signature verification, timestamp expiry, and replay all return an identical 401 so a caller cannot tell which check failed.

Replay and abuse protection

An in-memory seen-signature cache reserves a signature before posting and commits it only after a successful post — so concurrent duplicates can't both post, and a failed post still allows a legitimate retry. Rate limiting is keyed on the TCP socket address (not the spoofable X-Forwarded-For) and the tracked-key set is bounded. Request bodies are capped at 8 KB before any HMAC work.

Posting

The payload is validated with a versioned schema (two v1 event types; unknown types are rejected). The message is rendered into a Discord embed with an event-type accent color and posted to a fixed channel from GATEWAY_PRESENCE_CHANNEL_ID (never from the payload), with allowedMentions: {parse: []} so payload text cannot trigger pings. Descriptions are truncated to Discord's limit. A non-null rendered_text is honored verbatim for forward compatibility.

Lifecycle

The HTTP server starts during gateway boot and closes in the shutdown drain alongside the Discord client; a server-close failure is logged without masking client teardown. New requests are refused with 503 while draining.

Configuration (new)

  • GATEWAY_WEBHOOK_SECRET (required), GATEWAY_PRESENCE_CHANNEL_ID (required), GATEWAY_HTTP_PORT (optional, default 3000).
  • The deploy wiring in the infra repo and the control-plane signing side are separate, follow-on work.

Gateway test suite: 423 passing.

Adds hono@4.12.23 + @hono/node-server@1.19.14 (matching apps/workspace-agent's
pinned versions) so the gateway can host the announce webhook HTTP server.
Includes the implementation plan.
…e posting

Building blocks for the POST /v1/announce control-plane presence webhook:
- Config: GATEWAY_WEBHOOK_SECRET, GATEWAY_PRESENCE_CHANNEL_ID, GATEWAY_HTTP_PORT
- HMAC-SHA256 verification over timestamp + '.' + rawBody (Stripe-style), with
  constant-time compare, length guards, and a replay-window check
- Effect Schema payload validation (two v1 event types; unknown rejected) with
  content-free error reasons
- Presence channel posting helper that resolves a channel by ID and sends an
  embed with allowedMentions:{parse:[]}
Maps each announce event_type to a Discord embed with an accent color and
in-character text. Honors a non-null rendered_text override verbatim (v1 emits
null). Includes reserved color stubs for the not-yet-emitted fast-follower
event types.
Adds the POST /v1/announce Hono server and the request pipeline: 8 KB size
cap, per-source rate limiting, HMAC verification, replay-window check, an
in-memory replay cache, exact-string timestamp cross-check, schema decode,
embed render, and presence post. Authentication failures return an identical
401 so callers cannot tell which check failed; the replay signature is
recorded only after a successful Discord post so retries after a post failure
still succeed. Reject logging records the reason only, never the body.
Starts the announce HTTP server during gateway boot and closes it in the
shutdown drain alongside the Discord client; a server-close failure is logged
without masking client teardown. New announce requests are refused with 503
while the gateway is draining for shutdown.
…ate-limit spoofing

- Replace the replay check/record pair with an atomic reserve/commit/release
  so two concurrent requests carrying the same signature can no longer both
  post; the reservation is released on every failure path so a legitimate
  retry after a failed post still succeeds.
- Key rate limiting on the TCP socket remote address instead of the
  caller-supplied X-Forwarded-For header, and bound the limiter to a maximum
  number of tracked keys to prevent unbounded pre-auth memory growth.
- Truncate embed descriptions to Discord's 4096-character limit so an
  over-long payload cannot wedge the control plane into endless retries.
- Drop an unreachable parse-error classification branch, share the body-size
  constant, and make the payload schema value module-private.
Updates the gateway agent notes for the new src/http/ announce webhook: the
fail-closed request pipeline, the generic-401 no-oracle auth behavior, the
presence posting helper, the shutdown drain that closes the HTTP server, and
the GATEWAY_ webhook config vars. Marks Effect Schema as now in use.
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: CONDITIONAL

Solid, security-conscious implementation. The fail-closed pipeline ordering, identical-401 auth oracle suppression, atomic reserve/commit replay handling, socket-keyed rate limiting, and mandatory allowedMentions:{parse:[]} are all done correctly and well-tested (423 passing). Two issues should be addressed before merge; both are narrow but real.

Blocking issues

  1. Unbounded body buffering before the 8 KB size check (server.ts:88-90). The handler's ANNOUNCE_MAX_BODY_BYTES guard runs on rawBody.byteLengthafter await c.req.arrayBuffer() has already buffered the entire request into memory. The content-length pre-check (server.ts:79-86) is the only thing gating allocation, and it is trivially bypassed by omitting or understating the Content-Length header (chunked transfer encoding, or a lying client). An unauthenticated caller can therefore force the gateway to allocate arbitrarily large buffers before any rejection — the rate limiter (60/min/IP) bounds frequency but not per-request size. This contradicts the PR description's claim that "Request bodies are capped at 8 KB before any HMAC work": the cap is before HMAC, but not before full-body buffering. Add a hard streaming body limit (e.g. Hono's bodyLimit middleware, or enforce maxRequestBodySize on the node server) so oversized requests are rejected during read, not after.

Non-blocking concerns

  1. Empty-string rendered_text yields an empty embed (announce-schema.ts:34,47 + templates.ts:77-78). The schema accepts rendered_text: "" (Schema.NullOr(Schema.String)), and renderEmbed uses any non-null value verbatim. An empty description produces an embed with no renderable content; Discord's API rejects such embeds, so the post fails and the handler returns 500 (and releases the reservation, so retries also fail). Exploitability is low because the body is HMAC-signed by a trusted control plane, but it is a correctness gap with no test coverage. Treat empty/whitespace-only rendered_text as "fall back to the templated description."

  2. Reserved replay-cache entries are never time-evicted (replay-cache.ts:69-92). evictExpired deliberately skips RESERVED entries, relying on release()/commit() always running. The handler does call release() on every post-reserve early return, but if postPresenceEmbed never settles (e.g. a hung Discord fetch with no timeout — presence.ts:56 has no timeout on channels.fetch), the reservation leaks for the process lifetime and permanently blocks that exact signature's retries. Consider a reservation timestamp + safety TTL, or a timeout around the Discord post.

  3. unknown_event_type classification is fragile but harmless (announce-schema.ts:82-89). Requiring every issue path to start with event_type is brittle against Union error shapes, but since the handler maps both unknown_event_type and bad_request to the same 400 body (announce-handler.ts:180-183), the only consequence is a log label. Fine as-is.

  4. evictExpired is O(n) per request in both rate-limit and replay caches. Bounded by MAX_KEYS=10_000, so acceptable at v1 scale; flagging for awareness if traffic grows.

Missing tests

  • No test for rendered_text: "" (empty string) through renderEmbed or end-to-end through the handler (the empty-embed → Discord-reject path).
  • No test for an oversized body sent without a truthful Content-Length (the streaming-buffer bypass). server.test.ts covers the 413 content-length precheck but not the header-omitted case.
  • No test for a leaked/stuck reservation (hung Discord post) blocking subsequent retries.

Risk assessment (MED)

  • Likelihood of regression: LOW — new isolated http/ module, no changes to existing command/runtime paths beyond additive lifecycle wiring in program.ts/shutdown.ts, which are covered.
  • Security exposure: MED — the body-buffering gap is an unauthenticated memory-pressure DoS vector reachable before any auth check; rate limiting mitigates but does not close it. All other auth/replay/mention controls are sound.
  • Blast radius: LOW — single new endpoint, single-replica constraint documented, no public API breakage. The Discord token is already held by the gateway, so the new ingress does not widen the credential surface.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26671488999
Cache hit
Session ses_1895e896affe6dP28TjXboKU2r

…the Discord post

- Enforce the 8 KB announce body cap with streaming bodyLimit middleware so an
  oversized request is rejected while the body is read, not after the whole
  payload has been buffered into memory. The earlier content-length precheck
  was bypassable with chunked transfer encoding, leaving an unauthenticated
  memory-pressure path; the streaming limit closes it.
- Bound the Discord fetch-and-post with a timeout so a hung call can no longer
  leave a replay reservation pinned for the process lifetime, which would have
  permanently blocked retries for that signature.
- Fall back to the templated description when rendered_text is empty or
  whitespace-only, avoiding an empty embed that Discord would reject.
Copy link
Copy Markdown
Owner

@fro-bot fro-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: PASS

All blocking and non-blocking issues from my prior review (commit 50f4456) are resolved with correct fixes and dedicated test coverage. The implementation is sound, security-conscious, and well-tested.

Blocking issues

None — the previously-blocking memory-DoS gap is fixed.

  • Body buffering before size check → fixed. bodyLimit middleware (server.ts:77-82) now enforces ANNOUNCE_MAX_BODY_BYTES during the streaming read, so a caller cannot bypass the cap via chunked transfer encoding or an omitted/understated Content-Length. The content-length fast-path and the handler's byteLength guard remain as defense-in-depth. Hono caches the body, so the later c.req.arrayBuffer() returns the already-bounded bytes. Covered by server.test.ts:412 (oversized body with no truthful Content-Length → 413, Discord never called).

Non-blocking concerns

All previously-raised concerns are now addressed:

  1. Empty rendered_text → fixed. renderEmbed now falls through to the per-event template when rendered_text is null OR trims to empty (templates.ts:80), so Discord never receives an empty embed description. Covered by templates.test.ts:147 (empty string) and :162 (whitespace-only), with a :177 regression guard that non-empty text is still used verbatim.
  2. Leaked replay reservation on a hung Discord call → fixed. postPresenceEmbed now races the fetch+send against a 10 s timeout (presence.ts:104-118), returning send-failed so the handler always reaches replayCache.release(). The dangling-rejection is swallowed with a no-op .catch, and clearTimeout runs on the win path. Covered by presence.test.ts:200 (fetch never resolves → "discord post timed out").
  3. unknown_event_type classification fragility — remains harmless (both branches map to the same 400 body); no change needed.
  4. evictExpired O(n) per request — bounded by MAX_KEYS=10_000; acceptable at v1 scale.

One minor forward-looking note (non-blocking, no action required): the timeout in postPresenceEmbed bounds the reservation leak but the timed-out Discord op may still complete in the background after release() — a rare double-post window if the control plane retries the same signature within the window. Acceptable for v1 given the trusted, HMAC-signed caller and single-replica constraint.

Missing tests

None — the three gaps I flagged previously now have explicit coverage (413 streaming-bypass, empty/whitespace rendered_text, hung-Discord timeout).

Risk assessment (LOW)

  • Likelihood of regression: LOW — isolated new http/ module; existing command/runtime paths untouched beyond covered additive lifecycle wiring.
  • Security exposure: LOW — the pre-auth memory-DoS vector is closed at the streaming boundary; auth, replay, rate-limit, and mention controls are all sound and tested.
  • Blast radius: LOW — single new endpoint, single-replica constraint documented, no public API breakage, no widening of the credential surface.

Run Summary
Field Value
Event pull_request
Repository fro-bot/agent
Run ID 26671760842
Cache hit
Session ses_1895e896affe6dP28TjXboKU2r

@marcusrbrown marcusrbrown merged commit 88cddce into main May 30, 2026
10 checks passed
@marcusrbrown marcusrbrown deleted the feat/gateway-announce-webhook branch May 30, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fro Bot presence webhook: POST /v1/announce for control-plane events

2 participants