A minimal example showing how to bridge Twilio Programmable Voice with AssemblyAI's Voice Agent API. Caller ↔ Twilio Media Streams ↔ this server ↔ wss://agents.assemblyai.com/v1/realtime.
Example only — not hardened for production. No retry logic, rate limiting, or call-state persistence.
- A caller dials your Twilio number.
- Twilio hits
POST /twimlon this server. The server returns TwiML containing a<Stream>pointed atwss://<your-hostname>/media-stream/<callId>. - Twilio opens a Media Streams WebSocket and starts sending the caller's audio (G.711 μ-law, 8 kHz).
- The server opens a parallel WebSocket to the Voice Agent API and sends a
session.updatewith the system prompt, voice, greeting, tools, and audio format set toaudio/pcmu. - Once
session.readyfires, the server forwards μ-law payloads in both directions:Twilio → AssemblyAI: each Twiliomediaevent becomes aninput.audioevent.AssemblyAI → Twilio: eachreply.audioevent becomes a Twiliomediaaction.
- When the user barges in (
input.speech.started), the server sends a Twilioclearaction so the bot stops talking immediately.
Because Twilio's native format and the Voice Agent API's audio/pcmu are byte-compatible, audio is forwarded as base64 with zero transcoding.
- A Twilio account with a phone number (buy one)
- An AssemblyAI API key with Voice Agent access (dashboard)
- ngrok (for local development)
- Node.js 20+
git clone <this-repo>
cd twilio-voice-agent
npm installTwilio needs a public URL to send webhooks and open the Media Streams WebSocket against. ngrok exposes your local server on a public HTTPS endpoint.
ngrok http 3000Copy the https://...ngrok.app URL.
cp .env.example .envOpen .env and fill in:
ASSEMBLYAI_API_KEY=your-assemblyai-api-key
HOSTNAME=https://your-ngrok-domain.ngrok.appnpm run devYou should see Server running on http://localhost:3000.
In the Twilio Console, open your phone number's Voice configuration and set:
- A call comes in → Webhook →
POST→https://<your-hostname>/twiml - Call status changes (optional) → Webhook →
POST→https://<your-hostname>/call-status
Dial your Twilio number from any phone. You should hear the greeting, then have a normal back-and-forth conversation. Watch the server logs to see the event stream.
Set the Twilio credentials in .env:
TWILIO_ACCOUNT_SID=ACxxxxxxxxxxxxxxxx
TWILIO_AUTH_TOKEN=your-twilio-auth-token
TWILIO_PHONE_NUMBER=+15551234567
TARGET_PHONE_NUMBER=+15557654321With the server still running, in a new terminal:
npm run outboundThis places a call from your Twilio number to the target. Twilio fetches /outbound-twiml, which connects the call to /outbound-stream. The agent uses the OUTBOUND_PROMPT and speaks first via the greeting field — no extra signaling required because the Voice Agent API plays the configured greeting automatically when session.ready fires.
| File | Purpose |
|---|---|
src/index.ts |
Express server, inbound + outbound flows, Twilio ↔ Voice Agent API bridge |
src/twilio.ts |
Typed wrapper around the Twilio Media Streams WebSocket |
src/bot.ts |
System prompt, default greeting, tool definitions, tool dispatch |
src/outbound.ts |
Standalone script that places a call via the Twilio REST API |
The example registers one tool, generate_random_number. When the model decides to call it:
- Voice Agent API → server:
tool.callevent withcall_id,name,args. - Server runs the tool (in
bot.ts → runTool) and returns a JSON string. - Server → Voice Agent API:
tool.resultevent with the samecall_id. - The model continues the conversation, naturally working the result into its next reply.
Add new tools by extending the TOOLS array and adding a case to runTool.
Both directions use audio/pcmu (G.711 μ-law, 8 kHz, mono). This is Twilio's native phone-call codec, so no resampling or transcoding is needed — the server forwards base64 payloads as-is. The full audio path on the call has only one encoding step (the caller's mic → μ-law on Twilio's edge) and one decoding step (μ-law → speaker at the listener's end).
If you adapt this example to a non-telephony transport (e.g. browser via WebRTC), switch to audio/pcm at 24 kHz on both sides.
- Twilio call connects but you hear nothing. Check
HOSTNAMEmatches your ngrok domain and that your server is reachable. Watch ngrok's request log for the incoming Media Streams WebSocket. session.errorwith codeinvalid_valueon thevoicefield. Voice names are case-sensitive — use lowercase (ivy,claire,dawn, etc.).- Greeting plays but later replies don't. Make sure your tool handler always sends a
tool.resultback. The model waits for it before continuing. - Audio is choppy or echoey. Twilio echo cancellation runs on the carrier side, so software AEC isn't needed on this server. If you're hearing echo locally during testing, it's likely your speakerphone — use a headset.