An agent-native voice-and-vision framework. Turn any audio/visual device -- earbuds, smart glasses, pendants, phones -- into an always-on AI companion that can perceive, understand, and act on your behalf.
Built by Intentlabs.
Supported platforms: iOS (iPhone) and Android
Today's voice AI apps (ChatGPT Voice, Gemini Live, Sesame) are conversational but not agentic. They can talk to you, but they cannot act for you. When they try to do complex tasks (search, multi-step workflows, API calls), they go silent for 10-30 seconds -- broken UX.
Meanwhile, agent frameworks (OpenClaw, Manus, Claude Code) can execute complex tasks but have no real-time voice interface.
No consumer product today combines real-time voice conversation with general-purpose agent execution. Matcha fills this gap.
Matcha separates real-time voice interaction from asynchronous task execution, allowing both to run simultaneously without blocking each other.
+-----------------------------+
| MATCHA CORE |
| |
User ---- Audio ------> | +---------------------+ |
Device Stream | | VOICE AGENT | |
(glasses, | | (synchronous) | |
earbuds, | | | |
pendant, | | Real-time voice | |
phone) | | conversation. | |
<-- Audio --- | | Always responsive. | |
Response | | Never blocked. | |
| +----------+-----------+ |
| | |
| delegates tasks |
| | |
| +----------v-----------+ |
| | ACTION AGENT | |
| | (asynchronous) | |
User ---- Video ------> | | | |
Device Frames | | Web search, API | |
(camera (~1fps) | | calls, messaging, | |
on | | smart home, etc. | |
glasses, | | | |
phone) | | Reports results | |
| | back to Voice | |
| | Agent when ready. | |
| +----------------------+ |
| |
+-----------------------------+
Voice Agent -- maintains real-time bidirectional audio with the user. Sub-second latency. Never blocked by tasks. Powered by Gemini Live API or OpenAI Realtime API.
Action Agent -- receives task delegations from Voice Agent. Executes complex, multi-step tasks in the background via either E2B cloud sandboxes (Claude Agent SDK) or OpenClaw (56+ skills: web search, messaging, smart home, notes, reminders, etc.). Reports results back to Voice Agent when ready.
Example flow:
- User: "Find me the best ramen places in SF that are open late"
- Voice Agent: "Sure, let me search for late-night ramen spots."
- Action Agent begins web search in background
- User: "Oh also, I want somewhere with vegetarian options"
- Voice Agent: "Got it, I'll filter for vegetarian-friendly places too."
- Action Agent returns results
- Voice Agent speaks the answer conversationally
The user is never left in silence. The agent is never limited to shallow answers.
Matcha is device-agnostic. It connects to any audio I/O device:
| Device | Audio In | Audio Out | Video In | Status |
|---|---|---|---|---|
| Phone (built-in) | Mic | Speaker | Camera | Working |
| AirPods / earbuds | Mic | Speaker | -- | Working |
| Meta Ray-Ban glasses | Mic | Speaker | Camera (via DAT SDK) | Working |
| Any Bluetooth audio | Mic | Speaker | -- | Working |
| Sesame glasses | Mic | Speaker | Camera | Planned |
| Apple glasses | Mic | Speaker | Camera | Planned |
| Pendant devices | Mic | Speaker | Camera | Planned |
Matcha is model-agnostic:
| Provider | Model | Status |
|---|---|---|
| Gemini 2.0 Flash (Live API) | Working | |
| OpenAI | GPT-4o Realtime API | Planned |
git clone https://github.com/Intent-Lab/matcha.git
cd matcha/samples/CameraAccess
open CameraAccess.xcodeprojcp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swiftEdit Secrets.swift with your Gemini API key (required) and optional E2B/OpenClaw/WebRTC config.
Select your iPhone as the target device and hit Run (Cmd+R).
Without glasses (iPhone mode):
- Tap "Start on iPhone" -- uses your iPhone's back camera
- Tap the AI button to start a voice session
- Talk to the AI -- it can see through your iPhone camera and execute tasks
With Meta Ray-Ban glasses:
First, enable Developer Mode in the Meta AI app:
- Open the Meta AI app on your iPhone
- Go to Settings (gear icon, bottom left)
- Tap App Info
- Tap the App version number 5 times -- this unlocks Developer Mode
- Go back to Settings -- you'll now see a Developer Mode toggle. Turn it on.
Then in the app:
- Tap "Start Streaming"
- Tap the AI button for voice + vision conversation
git clone https://github.com/Intent-Lab/matcha.gitOpen samples/CameraAccessAndroid/ in Android Studio.
The Meta DAT Android SDK is distributed via GitHub Packages. You need a GitHub Personal Access Token with read:packages scope.
- Go to GitHub > Settings > Developer Settings > Personal Access Tokens and create a classic token with
read:packagesscope - In
samples/CameraAccessAndroid/local.properties, add:
github_token=YOUR_GITHUB_TOKENcd samples/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/
cp Secrets.kt.example Secrets.ktEdit Secrets.kt with your Gemini API key (required) and optional E2B/OpenClaw/WebRTC config.
- Let Gradle sync in Android Studio
- Select your Android phone as the target device
- Click Run (Shift+F10)
Without glasses (Phone mode):
- Tap "Start on Phone" -- uses your phone's back camera
- Tap the AI button to start a voice session
- Talk to the AI -- it can see through your phone camera and execute tasks
With Meta Ray-Ban glasses:
Enable Developer Mode in the Meta AI app (same steps as iOS above), then:
- Tap "Start Streaming" in the app
- Tap the AI button for voice + vision conversation
Matcha supports two agent backends for task execution. You can switch between them at runtime in the in-app Settings > Agent Backend picker.
| Backend | Description | Best for |
|---|---|---|
| E2B | Cloud-hosted sandbox (E2B + Claude Agent SDK). Deploy the agent/ directory to Vercel. Supports streaming tool progress. |
Production, multi-user |
| OpenClaw | Local Mac gateway with 56+ skills. Runs on your local network. | Development, personal use |
Without either backend configured, the AI is voice + vision only (no task execution).
The E2B backend runs a Claude Agent SDK sandbox in the cloud. It supports real-time streaming of tool execution progress (which tools are running, their results, etc.).
Deploy the agent/ directory to Vercel:
cd agent
vercel deployiOS -- In Secrets.swift:
static let agentBaseURL = "https://your-deployment.vercel.app"
static let agentToken = "your-shared-secret-token"Android -- In Secrets.kt:
const val agentBaseURL = "https://your-deployment.vercel.app"
const val agentToken = "your-shared-secret-token"Open Settings in the app and set Agent Backend to E2B.
OpenClaw gives Matcha the ability to take real-world actions: send messages, search the web, manage lists, control smart home devices, and more.
Follow the OpenClaw setup guide. Make sure the gateway is enabled:
In ~/.openclaw/openclaw.json:
{
"gateway": {
"port": 18789,
"bind": "lan",
"auth": {
"mode": "token",
"token": "your-gateway-token-here"
},
"http": {
"endpoints": {
"chatCompletions": { "enabled": true }
}
}
}
}iOS -- In Secrets.swift:
static let openClawHost = "http://Your-Mac.local"
static let openClawPort = 18789
static let openClawGatewayToken = "your-gateway-token-here"Android -- In Secrets.kt:
const val openClawHost = "http://Your-Mac.local"
const val openClawPort = 18789
const val openClawGatewayToken = "your-gateway-token-here"Open Settings in the app and set Agent Backend to OpenClaw. You can use the Test Connection button to verify connectivity.
openclaw gateway restartsamples/CameraAccess/CameraAccess/
Core/ # Dual-agent framework
Protocols/
VoiceModelProvider.swift # Abstract voice model interface
AgentProtocol.swift # AgentTask, AgentResult types
Models/
GeminiLiveProvider.swift # Gemini Live API adapter
Agents/
VoiceAgent.swift # Real-time voice session manager
ActionAgent.swift # Async task executor
AgentCoordinator.swift # Dual-agent orchestrator
Agent/ # Agent backend (E2B + OpenClaw)
AgentBridge.swift # Dual-backend bridge (E2B sandbox streaming + OpenClaw)
AgentConfig.swift # E2B configuration
ToolCallModels.swift # Tool declarations (execute, capture_photo), data types
Gemini/ # Voice model infrastructure
GeminiLiveService.swift # WebSocket client for Gemini Live API
AudioManager.swift # Mic capture (PCM 16kHz) + playback (PCM 24kHz)
GeminiSessionViewModel.swift # Session lifecycle (delegates to AgentCoordinator)
GeminiConfig.swift # API keys, model config, system prompt
OpenClaw/ # Tool call routing
ToolCallRouter.swift # Routes Gemini tool calls to agent backend
iPhone/ # Phone camera fallback
IPhoneCameraManager.swift
WebRTC/ # Live streaming (glasses POV to browser)
WebRTCClient.swift
SignalingClient.swift
Settings/
SettingsManager.swift # Persisted settings (agent backend, API keys, session)
SettingsView.swift # Settings UI with backend picker, connection test
samples/CameraAccessAndroid/app/src/main/java/.../cameraaccess/
gemini/ # Voice model infrastructure
GeminiLiveService.kt # WebSocket client for Gemini Live API
AudioManager.kt # Mic capture (PCM 16kHz) + playback (PCM 24kHz)
GeminiSessionViewModel.kt # Session lifecycle, auto-reconnect, capture_photo handler
GeminiConfig.kt # API keys, model config, system prompt
openclaw/ # Agent backend (E2B + OpenClaw)
OpenClawBridge.kt # Dual-backend bridge (E2B sandbox streaming + OpenClaw)
ToolCallRouter.kt # Routes Gemini tool calls to agent backend
ToolCallModels.kt # Tool declarations (execute, capture_photo), data types
settings/ # Settings
SettingsManager.kt # Persisted settings (agent backend, API keys, session)
ui/ # Compose UI
SettingsScreen.kt # Settings UI with backend picker, connection test
StreamScreen.kt # Main streaming view
stream/ # Camera streaming
StreamViewModel.kt # Camera frame management
StreamingMode.kt # Glasses vs phone mode
webrtc/ # Live streaming (glasses POV to browser)
WebRTCClient.kt
WebRTCSessionViewModel.kt
wearables/ # Meta glasses integration (DAT SDK)
WearablesViewModel.kt
- Input: Mic -> AudioManager (PCM Int16, 16kHz mono, 100ms chunks) -> Voice Model WebSocket
- Output: Voice Model WebSocket -> AudioManager playback queue -> Speaker
- Echo cancellation: Aggressive AEC (
voiceChat) when speaker is on phone; mild AEC (videoChat) when using glasses - Mic muting: Automatically mutes mic while AI speaks when speaker + mic are co-located
- User says "Add eggs to my shopping list"
- Voice Agent acknowledges: "Sure, adding that now"
- Voice Agent delegates task to Action Agent
- Action Agent routes to the selected backend:
- E2B: Initializes sandbox, streams execution via SSE (with tool progress updates), falls back to Vercel if sandbox unavailable
- OpenClaw: Sends HTTP POST to local gateway
- Backend executes the task
- Action Agent returns result to Voice Agent
- Voice Agent speaks the confirmation
The Voice Agent remains responsive throughout -- the user can continue talking while tasks execute.
- Dual-agent architecture (Voice Agent + Action Agent)
- VoiceModelProvider protocol (model-agnostic)
- Gemini Live provider
- OpenClaw integration for task execution
- E2B cloud sandbox integration with SSE streaming
- Agent backend switcher (E2B / OpenClaw) on both platforms
- Auto-reconnect with exponential backoff
- Camera photo capture via voice command
- iOS and Android apps (feature parity)
- OpenAI Realtime provider
- Device provider abstraction
- Camera-based intent inference
- Proactive assistance (auto-translate foreign text, surface contextual info)
- Cross-frame memory ("What was that sign I saw 2 minutes ago?")
- Gaze-based intent prediction (with eye-tracking hardware)
- iOS 17.0+
- Xcode 15.0+
- Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use iPhone mode for testing)
- OpenClaw on your Mac (optional -- for task execution)
- Android 14+ (API 34+)
- Android Studio Ladybug or newer
- GitHub account with
read:packagestoken (for DAT SDK) - Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use Phone mode for testing)
- OpenClaw on your Mac (optional -- for task execution)
AI doesn't hear me -- Check that microphone permission is granted. Speak clearly and at normal volume.
OpenClaw connection timeout -- Make sure your phone and Mac are on the same Wi-Fi network, the gateway is running (openclaw gateway restart), and the hostname matches your Mac's Bonjour name.
"Gemini API key not configured" -- Add your API key in Secrets.swift/Secrets.kt or in the in-app Settings.
Echo/feedback in iPhone mode -- The app mutes the mic while the AI is speaking. If you still hear echo, try turning down the volume.
Android: Gradle sync fails with 401 -- Your GitHub token is missing or doesn't have read:packages scope. Check local.properties. Generate a new token at github.com/settings/tokens.
For DAT SDK issues, see the developer documentation or the discussions forum.
See CONTRIBUTING.md.
This source code is licensed under the license found in the LICENSE file in the root directory of this source tree.