feat(whatsapp): inbound + outbound media attachments#174
Conversation
WhatsApp was text-only: inbound media was dropped (only captions kept)
and the agent couldn't send files. Wire the channel to the existing
attachment infrastructure in both directions.
Inbound: images/video/documents are downloaded via Baileys
downloadMediaMessage and uploaded as session attachments (images become
vision input); captions ride along as text. Voice/audio notes are
transcribed via the agent's STT, and replies to a voice note are spoken
back as a WhatsApp voice note when TTS is configured. Media-only messages
now trigger the agent instead of being dropped. Media over the 25 MiB cap
is skipped.
Outbound: the agent's attachment_send deliveries (SSE 'attachment' event)
are routed to the matching Baileys send — image/video/document, and
ogg/opus audio as a native push-to-talk voice note.
No core/protocol/gateway changes — uses SDK uploadAttachment,
postMessage({attachments}), downloadAttachmentBytes, transcribeAudio,
synthesizeAudio. Bumps 0.2.0 -> 0.3.0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughAdds inbound media extraction and 25 MiB cap enforcement, STT for voice notes, upload of non-audio media as agent attachments, outbound media/TTS delivery and attachment routing, and tests/docs plus a package version bump. ChangesWhatsApp Media and Voice Handling
Sequence DiagramssequenceDiagram
participant WhatsApp as WhatsApp Message
participant Bot as bot.ts Handler
participant Extract as extractMedia
participant Incoming as Incoming Event
participant Bridge as Bridge Handler
participant Agent as Agent Client
participant API as WhatsApp API
WhatsApp->>Bot: raw message with image
Bot->>Bot: downloadMedia bytes
Bot->>Bot: enforce 25 MiB cap
Bot->>Extract: parse extracted media
Extract->>Incoming: return kind/mimeType/filename
Incoming->>Bridge: WhatsAppIncomingMessage with media
alt inbound is audio
Bridge->>Agent: transcribeAudio(bytes) STT
Agent->>Bridge: transcript text
Bridge->>Agent: postMessage(voiceInstruction + transcript)
else inbound is other media
Bridge->>Agent: uploadAttachment(bytes)
Agent->>Bridge: attachmentId
Bridge->>Agent: postMessage(attachmentRef)
end
Agent->>Bridge: SSE response frames
alt audio inbound + suitable text
Bridge->>Agent: synthesizeAudio(text) TTS
Agent->>Bridge: audio ogg bytes
Bridge->>API: sendMedia(kind=audio, ptt=true)
API->>WhatsApp: voice note message
else agent attachment frame
Bridge->>Agent: downloadAttachmentBytes(id)
Agent->>Bridge: media bytes + kind + filename
Bridge->>API: sendMedia(kind, bytes, filename)
API->>WhatsApp: media message
else text only
Bridge->>API: sendMessage(text)
API->>WhatsApp: text message
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/channels/whatsapp/package.json`:
- Around line 3-4: Update the package.json "description" field in the WhatsApp
channel package to remove the outdated "Text-only v1." wording and reflect
current media/voice support introduced in v0.3.0; locate the "description" key
in apps/channels/whatsapp/package.json and replace the string to something
accurate (e.g., mention media and voice support and the current version/tag) so
npm metadata no longer misstates capabilities.
In `@apps/channels/whatsapp/src/bridge.ts`:
- Around line 174-177: The code currently sends raw STT error text back to the
user; instead, keep the detailed error in logs and send a generic, user-safe
message. In the STT error handling block (where err/msg, this.log, and this.send
are used) log the full error message with this.log(`stt failed for chat
${event.chatJid}: ${msg}`) but change the await this.send(...) call to send a
generic text like "Voice transcription failed. Please try again later." (do not
include msg or err in the user-facing string); keep sessionId and event.chatJid
unchanged.
In `@apps/channels/whatsapp/src/whatsapp-api.ts`:
- Around line 97-98: Normalize media.mimeType before computing ptt: trim and
lower-case media.mimeType and strip any MIME parameters after a semicolon (e.g.,
"audio/ogg; codecs=opus") into a normalizedMime variable, then use that
normalized value when setting ptt (instead of raw media.mimeType) and when
building content (audio: buffer, mimetype: media.mimeType or normalizedMime as
appropriate); update the ptt check and content assignment around the ptt/ptt
true logic (the existing ptt variable and content assignment) to reference
normalizedMime.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fd62bc4f-eeef-4f1e-bb1d-42566b5add92
📒 Files selected for processing (9)
apps/channels/whatsapp/README.mdapps/channels/whatsapp/package.jsonapps/channels/whatsapp/src/bot.tsapps/channels/whatsapp/src/bridge.tsapps/channels/whatsapp/src/whatsapp-api.tsapps/channels/whatsapp/test/bot.test.tsapps/channels/whatsapp/test/whatsapp-api.test.tsdocs/channel-adapter.mddocs/manual/17-channels.md
Three review fixes: - package.json description no longer says "Text-only v1." (stale npm metadata after the 0.3.0 media/voice rollout). - On STT failure, log the detail but show the user a generic message instead of forwarding the raw error text into the chat. - Normalize the audio MIME (strip params like `; codecs=opus`) before push-to-talk detection, so `audio/ogg; codecs=opus` still sends as a voice note. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Addressed all three review comments in the latest commit: updated the stale package description, masked the raw STT error behind a generic user-facing message (detail still logged), and normalized the audio MIME (strip |
First channel in the attachment-support rollout (plan: WhatsApp → Discord → Slack → Telegram inbound).
WhatsApp was text-only — inbound media was dropped (only captions kept) and the agent couldn't send files. This wires the channel to the already-complete core attachment infrastructure in both directions. No core/protocol/gateway changes.
Inbound (WhatsApp → agent)
downloadMediaMessage, uploaded as session attachments (images become vision input automatically), captions kept as text.client.transcribeAudio; replies to a voice note are spoken back as a native WhatsApp voice note viaclient.synthesizeAudiowhen TTS is configured (mirrors Telegram).Outbound (agent → WhatsApp)
attachment_senddeliveries (SSEattachmentevent) are routed to the matching Baileys send: image / video / document, andogg/opusaudio as a push-to-talk voice note.Implementation
whatsapp-api.ts: newsendMedia(outbound routing) +downloadMedia(Baileys download).bot.ts: newextractMedia; downloads bytes; forwards media-only messages.bridge.ts: STT/TTS path,uploadAttachment+postMessage({attachments}), and anattachmentcase in the SSE loop.uploadAttachment,postMessage({attachments}),downloadAttachmentBytes,transcribeAudio,synthesizeAudio.Tests
35 unit tests pass (added
extractMedia, media-onlytoIncomingMessage, andsendMediarouting coverage).Bumps
@openhermit/channel-whatsapp0.2.0 → 0.3.0. Docs updated (README, channel-adapter, manual).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
/new, optional allow-list, and unsupported platform features.