Skip to content

feat(whatsapp): inbound + outbound media attachments#174

Merged
williamwa merged 2 commits into
mainfrom
feat/whatsapp-attachments
May 29, 2026
Merged

feat(whatsapp): inbound + outbound media attachments#174
williamwa merged 2 commits into
mainfrom
feat/whatsapp-attachments

Conversation

@williamwa
Copy link
Copy Markdown
Collaborator

@williamwa williamwa commented May 29, 2026

First channel in the attachment-support rollout (plan: WhatsApp → Discord → Slack → Telegram inbound).

WhatsApp was text-only — inbound media was dropped (only captions kept) and the agent couldn't send files. This wires the channel to the already-complete core attachment infrastructure in both directions. No core/protocol/gateway changes.

Inbound (WhatsApp → agent)

  • Images / video / documents → downloaded via Baileys downloadMediaMessage, uploaded as session attachments (images become vision input automatically), captions kept as text.
  • Voice / audio notes → transcribed via client.transcribeAudio; replies to a voice note are spoken back as a native WhatsApp voice note via client.synthesizeAudio when TTS is configured (mirrors Telegram).
  • Media-only messages now trigger the agent instead of being silently dropped.
  • Media over the 25 MiB attachment cap is skipped (logged).

Outbound (agent → WhatsApp)

  • The agent's attachment_send deliveries (SSE attachment event) are routed to the matching Baileys send: image / video / document, and ogg/opus audio as a push-to-talk voice note.

Implementation

  • whatsapp-api.ts: new sendMedia (outbound routing) + downloadMedia (Baileys download).
  • bot.ts: new extractMedia; downloads bytes; forwards media-only messages.
  • bridge.ts: STT/TTS path, uploadAttachment + postMessage({attachments}), and an attachment case in the SSE loop.
  • Reuses SDK uploadAttachment, postMessage({attachments}), downloadAttachmentBytes, transcribeAudio, synthesizeAudio.

Tests

35 unit tests pass (added extractMedia, media-only toIncomingMessage, and sendMedia routing coverage).

Bumps @openhermit/channel-whatsapp 0.2.0 → 0.3.0. Docs updated (README, channel-adapter, manual).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • WhatsApp channel supports media attachments (images, video, documents) with captions preserved; images are available as vision inputs.
    • Voice notes are transcribed to text and agent replies can be sent back as voice (when TTS is enabled).
    • Outbound attachment sends route to WhatsApp, and media-only messages are suppressed if download fails.
  • Documentation

    • Updated docs and README to describe media/voice support, 25 MiB attachment cap, fresh-session /new, optional allow-list, and unsupported platform features.

Review Change Stack

WhatsApp was text-only: inbound media was dropped (only captions kept)
and the agent couldn't send files. Wire the channel to the existing
attachment infrastructure in both directions.

Inbound: images/video/documents are downloaded via Baileys
downloadMediaMessage and uploaded as session attachments (images become
vision input); captions ride along as text. Voice/audio notes are
transcribed via the agent's STT, and replies to a voice note are spoken
back as a WhatsApp voice note when TTS is configured. Media-only messages
now trigger the agent instead of being dropped. Media over the 25 MiB cap
is skipped.

Outbound: the agent's attachment_send deliveries (SSE 'attachment' event)
are routed to the matching Baileys send — image/video/document, and
ogg/opus audio as a native push-to-talk voice note.

No core/protocol/gateway changes — uses SDK uploadAttachment,
postMessage({attachments}), downloadAttachmentBytes, transcribeAudio,
synthesizeAudio. Bumps 0.2.0 -> 0.3.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72ffcde6-e7e6-47c5-909d-762687ff52ce

📥 Commits

Reviewing files that changed from the base of the PR and between d0b39e5 and 9c350a7.

📒 Files selected for processing (4)
  • apps/channels/whatsapp/package.json
  • apps/channels/whatsapp/src/bridge.ts
  • apps/channels/whatsapp/src/whatsapp-api.ts
  • apps/channels/whatsapp/test/whatsapp-api.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • apps/channels/whatsapp/test/whatsapp-api.test.ts
  • apps/channels/whatsapp/package.json
  • apps/channels/whatsapp/src/whatsapp-api.ts

📝 Walkthrough

Walkthrough

Adds inbound media extraction and 25 MiB cap enforcement, STT for voice notes, upload of non-audio media as agent attachments, outbound media/TTS delivery and attachment routing, and tests/docs plus a package version bump.

Changes

WhatsApp Media and Voice Handling

Layer / File(s) Summary
Inbound media extraction and size enforcement
apps/channels/whatsapp/src/bot.ts, apps/channels/whatsapp/test/bot.test.ts
Media download and 25 MiB cap enforcement in handleRawMessage; exported extractMedia detects image/video/audio/document and synthesizes filenames; toIncomingMessage gates on text-or-media and normalizes output. Tests cover extraction and caption handling.
WhatsApp API media methods
apps/channels/whatsapp/src/whatsapp-api.ts, apps/channels/whatsapp/test/whatsapp-api.test.ts
Introduces WhatsAppOutboundMedia and public sendMedia/downloadMedia methods. sendMedia maps kind/mime to Baileys payloads (images/documents/caption handling; ogg/opus audio -> ptt voice notes). Tests assert sent payload shapes.
Inbound bridge: STT and attachment upload
apps/channels/whatsapp/src/bridge.ts
Adds WhatsAppIncomingMedia and optional media on incoming messages. handleIncomingInner transcribes audio via agent STT (injects voice instruction + transcript) and uploads non-audio media as attachments for the agent.
Outbound TTS and attachment delivery
apps/channels/whatsapp/src/bridge.ts
After agent response, attempts TTS voice replies when appropriate (shouldSpeak gating); adds deliverAttachment to process SSE attachment frames, download bytes from agent, resolve kind/name/caption, and forward to WhatsApp; waitForAgentResponse accepts target JID.
Documentation and version bump
apps/channels/whatsapp/README.md, apps/channels/whatsapp/package.json, docs/channel-adapter.md, docs/manual/17-channels.md
README and docs updated to document inbound/outbound media, STT/TTS, and 25 MiB skip rule; package version bumped 0.2.0 → 0.3.0 and description revised.

Sequence Diagrams

sequenceDiagram
  participant WhatsApp as WhatsApp Message
  participant Bot as bot.ts Handler
  participant Extract as extractMedia
  participant Incoming as Incoming Event
  participant Bridge as Bridge Handler
  participant Agent as Agent Client
  participant API as WhatsApp API
  
  WhatsApp->>Bot: raw message with image
  Bot->>Bot: downloadMedia bytes
  Bot->>Bot: enforce 25 MiB cap
  Bot->>Extract: parse extracted media
  Extract->>Incoming: return kind/mimeType/filename
  Incoming->>Bridge: WhatsAppIncomingMessage with media
  
  alt inbound is audio
    Bridge->>Agent: transcribeAudio(bytes) STT
    Agent->>Bridge: transcript text
    Bridge->>Agent: postMessage(voiceInstruction + transcript)
  else inbound is other media
    Bridge->>Agent: uploadAttachment(bytes)
    Agent->>Bridge: attachmentId
    Bridge->>Agent: postMessage(attachmentRef)
  end
  
  Agent->>Bridge: SSE response frames
  alt audio inbound + suitable text
    Bridge->>Agent: synthesizeAudio(text) TTS
    Agent->>Bridge: audio ogg bytes
    Bridge->>API: sendMedia(kind=audio, ptt=true)
    API->>WhatsApp: voice note message
  else agent attachment frame
    Bridge->>Agent: downloadAttachmentBytes(id)
    Agent->>Bridge: media bytes + kind + filename
    Bridge->>API: sendMedia(kind, bytes, filename)
    API->>WhatsApp: media message
  else text only
    Bridge->>API: sendMessage(text)
    API->>WhatsApp: text message
  end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • HCF-STUDIOS/openhermit#124: Introduces the AgentLocalClient attachment upload/download and transcribeAudio/synthesizeAudio APIs used by this PR's bridge inbound/outbound flow.
  • HCF-STUDIOS/openhermit#155: The initial WhatsApp adapter implementation that this PR extends with media extraction and send/downloadMedia paths.
  • HCF-STUDIOS/openhermit#164: Also modifies bridge.ts's waitForAgentResponse and agent response handling, touching related response-routing logic.

Poem

🐰 a rabbit hums
Media hops in, capped at twenty-five,
Voice notes whisper, transcripts come alive,
The agent sends back songs or files to say—
Hooray for WhatsApp’s brighter play!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(whatsapp): inbound + outbound media attachments' clearly and concisely summarizes the main change—adding bidirectional media support to the WhatsApp channel—and is directly reflected throughout the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/whatsapp-attachments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/channels/whatsapp/package.json`:
- Around line 3-4: Update the package.json "description" field in the WhatsApp
channel package to remove the outdated "Text-only v1." wording and reflect
current media/voice support introduced in v0.3.0; locate the "description" key
in apps/channels/whatsapp/package.json and replace the string to something
accurate (e.g., mention media and voice support and the current version/tag) so
npm metadata no longer misstates capabilities.

In `@apps/channels/whatsapp/src/bridge.ts`:
- Around line 174-177: The code currently sends raw STT error text back to the
user; instead, keep the detailed error in logs and send a generic, user-safe
message. In the STT error handling block (where err/msg, this.log, and this.send
are used) log the full error message with this.log(`stt failed for chat
${event.chatJid}: ${msg}`) but change the await this.send(...) call to send a
generic text like "Voice transcription failed. Please try again later." (do not
include msg or err in the user-facing string); keep sessionId and event.chatJid
unchanged.

In `@apps/channels/whatsapp/src/whatsapp-api.ts`:
- Around line 97-98: Normalize media.mimeType before computing ptt: trim and
lower-case media.mimeType and strip any MIME parameters after a semicolon (e.g.,
"audio/ogg; codecs=opus") into a normalizedMime variable, then use that
normalized value when setting ptt (instead of raw media.mimeType) and when
building content (audio: buffer, mimetype: media.mimeType or normalizedMime as
appropriate); update the ptt check and content assignment around the ptt/ptt
true logic (the existing ptt variable and content assignment) to reference
normalizedMime.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd62bc4f-eeef-4f1e-bb1d-42566b5add92

📥 Commits

Reviewing files that changed from the base of the PR and between 38db5d9 and d0b39e5.

📒 Files selected for processing (9)
  • apps/channels/whatsapp/README.md
  • apps/channels/whatsapp/package.json
  • apps/channels/whatsapp/src/bot.ts
  • apps/channels/whatsapp/src/bridge.ts
  • apps/channels/whatsapp/src/whatsapp-api.ts
  • apps/channels/whatsapp/test/bot.test.ts
  • apps/channels/whatsapp/test/whatsapp-api.test.ts
  • docs/channel-adapter.md
  • docs/manual/17-channels.md

Comment thread apps/channels/whatsapp/package.json Outdated
Comment thread apps/channels/whatsapp/src/bridge.ts
Comment thread apps/channels/whatsapp/src/whatsapp-api.ts Outdated
Three review fixes:
- package.json description no longer says "Text-only v1." (stale npm
  metadata after the 0.3.0 media/voice rollout).
- On STT failure, log the detail but show the user a generic message
  instead of forwarding the raw error text into the chat.
- Normalize the audio MIME (strip params like `; codecs=opus`) before
  push-to-talk detection, so `audio/ogg; codecs=opus` still sends as a
  voice note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@williamwa
Copy link
Copy Markdown
Collaborator Author

Addressed all three review comments in the latest commit: updated the stale package description, masked the raw STT error behind a generic user-facing message (detail still logged), and normalized the audio MIME (strip ; codecs=...) before push-to-talk detection so audio/ogg; codecs=opus still sends as a voice note. Added a test for the MIME-param case. Build + typecheck + tests green.

@williamwa williamwa merged commit 9d50a9e into main May 29, 2026
1 check passed
@williamwa williamwa deleted the feat/whatsapp-attachments branch May 29, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant