feat(whatsapp): inbound + outbound media attachments by williamwa · Pull Request #174 · HCF-STUDIOS/openhermit

williamwa · 2026-05-29T10:01:23Z

First channel in the attachment-support rollout (plan: WhatsApp → Discord → Slack → Telegram inbound).

WhatsApp was text-only — inbound media was dropped (only captions kept) and the agent couldn't send files. This wires the channel to the already-complete core attachment infrastructure in both directions. No core/protocol/gateway changes.

Inbound (WhatsApp → agent)

Images / video / documents → downloaded via Baileys downloadMediaMessage, uploaded as session attachments (images become vision input automatically), captions kept as text.
Voice / audio notes → transcribed via client.transcribeAudio; replies to a voice note are spoken back as a native WhatsApp voice note via client.synthesizeAudio when TTS is configured (mirrors Telegram).
Media-only messages now trigger the agent instead of being silently dropped.
Media over the 25 MiB attachment cap is skipped (logged).

Outbound (agent → WhatsApp)

The agent's attachment_send deliveries (SSE attachment event) are routed to the matching Baileys send: image / video / document, and ogg/opus audio as a push-to-talk voice note.

Implementation

whatsapp-api.ts: new sendMedia (outbound routing) + downloadMedia (Baileys download).
bot.ts: new extractMedia; downloads bytes; forwards media-only messages.
bridge.ts: STT/TTS path, uploadAttachment + postMessage({attachments}), and an attachment case in the SSE loop.
Reuses SDK uploadAttachment, postMessage({attachments}), downloadAttachmentBytes, transcribeAudio, synthesizeAudio.

Tests

35 unit tests pass (added extractMedia, media-only toIncomingMessage, and sendMedia routing coverage).

Bumps @openhermit/channel-whatsapp 0.2.0 → 0.3.0. Docs updated (README, channel-adapter, manual).

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- WhatsApp channel supports media attachments (images, video, documents) with captions preserved; images are available as vision inputs.
- Voice notes are transcribed to text and agent replies can be sent back as voice (when TTS is enabled).
- Outbound attachment sends route to WhatsApp, and media-only messages are suppressed if download fails.
Documentation
- Updated docs and README to describe media/voice support, 25 MiB attachment cap, fresh-session /new, optional allow-list, and unsupported platform features.

WhatsApp was text-only: inbound media was dropped (only captions kept) and the agent couldn't send files. Wire the channel to the existing attachment infrastructure in both directions. Inbound: images/video/documents are downloaded via Baileys downloadMediaMessage and uploaded as session attachments (images become vision input); captions ride along as text. Voice/audio notes are transcribed via the agent's STT, and replies to a voice note are spoken back as a WhatsApp voice note when TTS is configured. Media-only messages now trigger the agent instead of being dropped. Media over the 25 MiB cap is skipped. Outbound: the agent's attachment_send deliveries (SSE 'attachment' event) are routed to the matching Baileys send — image/video/document, and ogg/opus audio as a native push-to-talk voice note. No core/protocol/gateway changes — uses SDK uploadAttachment, postMessage({attachments}), downloadAttachmentBytes, transcribeAudio, synthesizeAudio. Bumps 0.2.0 -> 0.3.0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-05-29T10:01:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 72ffcde6-e7e6-47c5-909d-762687ff52ce

📥 Commits

Reviewing files that changed from the base of the PR and between d0b39e5 and 9c350a7.

📒 Files selected for processing (4)

apps/channels/whatsapp/package.json
apps/channels/whatsapp/src/bridge.ts
apps/channels/whatsapp/src/whatsapp-api.ts
apps/channels/whatsapp/test/whatsapp-api.test.ts

🚧 Files skipped from review as they are similar to previous changes (3)

apps/channels/whatsapp/test/whatsapp-api.test.ts
apps/channels/whatsapp/package.json
apps/channels/whatsapp/src/whatsapp-api.ts

📝 Walkthrough

Walkthrough

Adds inbound media extraction and 25 MiB cap enforcement, STT for voice notes, upload of non-audio media as agent attachments, outbound media/TTS delivery and attachment routing, and tests/docs plus a package version bump.

Changes

WhatsApp Media and Voice Handling

Layer / File(s)	Summary
Inbound media extraction and size enforcement `apps/channels/whatsapp/src/bot.ts`, `apps/channels/whatsapp/test/bot.test.ts`	Media download and 25 MiB cap enforcement in `handleRawMessage`; exported `extractMedia` detects image/video/audio/document and synthesizes filenames; `toIncomingMessage` gates on text-or-media and normalizes output. Tests cover extraction and caption handling.
WhatsApp API media methods `apps/channels/whatsapp/src/whatsapp-api.ts`, `apps/channels/whatsapp/test/whatsapp-api.test.ts`	Introduces `WhatsAppOutboundMedia` and public `sendMedia`/`downloadMedia` methods. `sendMedia` maps kind/mime to Baileys payloads (images/documents/caption handling; ogg/opus audio -> ptt voice notes). Tests assert sent payload shapes.
Inbound bridge: STT and attachment upload `apps/channels/whatsapp/src/bridge.ts`	Adds `WhatsAppIncomingMedia` and optional `media` on incoming messages. `handleIncomingInner` transcribes audio via agent STT (injects voice instruction + transcript) and uploads non-audio media as attachments for the agent.
Outbound TTS and attachment delivery `apps/channels/whatsapp/src/bridge.ts`	After agent response, attempts TTS voice replies when appropriate (`shouldSpeak` gating); adds `deliverAttachment` to process SSE attachment frames, download bytes from agent, resolve kind/name/caption, and forward to WhatsApp; `waitForAgentResponse` accepts target JID.
Documentation and version bump `apps/channels/whatsapp/README.md`, `apps/channels/whatsapp/package.json`, `docs/channel-adapter.md`, `docs/manual/17-channels.md`	README and docs updated to document inbound/outbound media, STT/TTS, and 25 MiB skip rule; package version bumped 0.2.0 → 0.3.0 and description revised.

Sequence Diagrams

sequenceDiagram
  participant WhatsApp as WhatsApp Message
  participant Bot as bot.ts Handler
  participant Extract as extractMedia
  participant Incoming as Incoming Event
  participant Bridge as Bridge Handler
  participant Agent as Agent Client
  participant API as WhatsApp API
  
  WhatsApp->>Bot: raw message with image
  Bot->>Bot: downloadMedia bytes
  Bot->>Bot: enforce 25 MiB cap
  Bot->>Extract: parse extracted media
  Extract->>Incoming: return kind/mimeType/filename
  Incoming->>Bridge: WhatsAppIncomingMessage with media
  
  alt inbound is audio
    Bridge->>Agent: transcribeAudio(bytes) STT
    Agent->>Bridge: transcript text
    Bridge->>Agent: postMessage(voiceInstruction + transcript)
  else inbound is other media
    Bridge->>Agent: uploadAttachment(bytes)
    Agent->>Bridge: attachmentId
    Bridge->>Agent: postMessage(attachmentRef)
  end
  
  Agent->>Bridge: SSE response frames
  alt audio inbound + suitable text
    Bridge->>Agent: synthesizeAudio(text) TTS
    Agent->>Bridge: audio ogg bytes
    Bridge->>API: sendMedia(kind=audio, ptt=true)
    API->>WhatsApp: voice note message
  else agent attachment frame
    Bridge->>Agent: downloadAttachmentBytes(id)
    Agent->>Bridge: media bytes + kind + filename
    Bridge->>API: sendMedia(kind, bytes, filename)
    API->>WhatsApp: media message
  else text only
    Bridge->>API: sendMessage(text)
    API->>WhatsApp: text message
  end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

HCF-STUDIOS/openhermit#124: Introduces the AgentLocalClient attachment upload/download and transcribeAudio/synthesizeAudio APIs used by this PR's bridge inbound/outbound flow.
HCF-STUDIOS/openhermit#155: The initial WhatsApp adapter implementation that this PR extends with media extraction and send/downloadMedia paths.
HCF-STUDIOS/openhermit#164: Also modifies bridge.ts's waitForAgentResponse and agent response handling, touching related response-routing logic.

Poem

🐰 a rabbit hums
Media hops in, capped at twenty-five,
Voice notes whisper, transcripts come alive,
The agent sends back songs or files to say—
Hooray for WhatsApp’s brighter play!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(whatsapp): inbound + outbound media attachments' clearly and concisely summarizes the main change—adding bidirectional media support to the WhatsApp channel—and is directly reflected throughout the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/whatsapp-attachments

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/channels/whatsapp/package.json`:
- Around line 3-4: Update the package.json "description" field in the WhatsApp
channel package to remove the outdated "Text-only v1." wording and reflect
current media/voice support introduced in v0.3.0; locate the "description" key
in apps/channels/whatsapp/package.json and replace the string to something
accurate (e.g., mention media and voice support and the current version/tag) so
npm metadata no longer misstates capabilities.

In `@apps/channels/whatsapp/src/bridge.ts`:
- Around line 174-177: The code currently sends raw STT error text back to the
user; instead, keep the detailed error in logs and send a generic, user-safe
message. In the STT error handling block (where err/msg, this.log, and this.send
are used) log the full error message with this.log(`stt failed for chat
${event.chatJid}: ${msg}`) but change the await this.send(...) call to send a
generic text like "Voice transcription failed. Please try again later." (do not
include msg or err in the user-facing string); keep sessionId and event.chatJid
unchanged.

In `@apps/channels/whatsapp/src/whatsapp-api.ts`:
- Around line 97-98: Normalize media.mimeType before computing ptt: trim and
lower-case media.mimeType and strip any MIME parameters after a semicolon (e.g.,
"audio/ogg; codecs=opus") into a normalizedMime variable, then use that
normalized value when setting ptt (instead of raw media.mimeType) and when
building content (audio: buffer, mimetype: media.mimeType or normalizedMime as
appropriate); update the ptt check and content assignment around the ptt/ptt
true logic (the existing ptt variable and content assignment) to reference
normalizedMime.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd62bc4f-eeef-4f1e-bb1d-42566b5add92

📥 Commits

Reviewing files that changed from the base of the PR and between 38db5d9 and d0b39e5.

📒 Files selected for processing (9)

apps/channels/whatsapp/README.md
apps/channels/whatsapp/package.json
apps/channels/whatsapp/src/bot.ts
apps/channels/whatsapp/src/bridge.ts
apps/channels/whatsapp/src/whatsapp-api.ts
apps/channels/whatsapp/test/bot.test.ts
apps/channels/whatsapp/test/whatsapp-api.test.ts
docs/channel-adapter.md
docs/manual/17-channels.md

Three review fixes: - package.json description no longer says "Text-only v1." (stale npm metadata after the 0.3.0 media/voice rollout). - On STT failure, log the detail but show the user a generic message instead of forwarding the raw error text into the chat. - Normalize the audio MIME (strip params like `; codecs=opus`) before push-to-talk detection, so `audio/ogg; codecs=opus` still sends as a voice note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

williamwa · 2026-05-29T11:27:33Z

Addressed all three review comments in the latest commit: updated the stale package description, masked the raw STT error behind a generic user-facing message (detail still logged), and normalized the audio MIME (strip ; codecs=...) before push-to-talk detection so audio/ogg; codecs=opus still sends as a voice note. Added a test for the MIME-param case. Build + typecheck + tests green.

williamwa mentioned this pull request May 29, 2026

feat(discord): inbound + outbound media attachments #175

Merged

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread apps/channels/whatsapp/package.json Outdated

Comment thread apps/channels/whatsapp/src/bridge.ts

Comment thread apps/channels/whatsapp/src/whatsapp-api.ts Outdated

This was referenced May 29, 2026

feat(slack): inbound + outbound media attachments #176

Merged

feat(telegram): inbound image / document / video #177

Merged

williamwa merged commit 9d50a9e into main May 29, 2026
1 check passed

williamwa deleted the feat/whatsapp-attachments branch May 29, 2026 11:53

williamwa mentioned this pull request May 29, 2026

chore(cli): bump to 0.9.2 #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(whatsapp): inbound + outbound media attachments#174

feat(whatsapp): inbound + outbound media attachments#174
williamwa merged 2 commits into
mainfrom
feat/whatsapp-attachments

williamwa commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

williamwa commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

williamwa commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Inbound (WhatsApp → agent)

Outbound (agent → WhatsApp)

Implementation

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

williamwa commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

williamwa commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading