Skip to content

feat(voice): dual-provider TTS (Supertonic local + ElevenLabs cloud) with Linux support#1301

Open
Trei-D wants to merge 1 commit into
danielmiessler:mainfrom
Trei-D:feat/dual-provider-voice-linux
Open

feat(voice): dual-provider TTS (Supertonic local + ElevenLabs cloud) with Linux support#1301
Trei-D wants to merge 1 commit into
danielmiessler:mainfrom
Trei-D:feat/dual-provider-voice-linux

Conversation

@Trei-D
Copy link
Copy Markdown

@Trei-D Trei-D commented May 24, 2026

Problem

The v5.0.0 voice module is macOS-only (uses afplay + osascript) and ElevenLabs-only (requires API key + quota). This means:

  1. Linux PAI users have no voiceafplay doesn't exist on Linux
  2. Voice costs money — every notification burns ElevenLabs API credits
  3. Voice requires internet — no offline/local option

Solution

Dual-provider TTS architecture with cross-platform audio playback.


New: Supertonic as local-first provider

Zero cost, zero internet, zero API key. Supertonic runs TTS inference on CPU using ONNX models that auto-download on first use.

Installation

cd ~/.claude/PAI/PULSE/VoiceServer

# Create Python venv and install Supertonic
python3 -m venv .venv
.venv/bin/pip install supertonic

# Verify installation
.venv/bin/python supertonic-tts.py --text "Hello from PAI" --voice M1 --output /tmp/test.wav

# Play the result (Linux)
paplay /tmp/test.wav

Requirements:

  • Python 3.10+ (tested with 3.12)
  • ~158 MB disk for the venv
  • ~386 MB disk for model cache (~/.cache/supertonic3/, downloaded on first run)

Available voices

Voice Gender Notes
M1–M5 Male 5 distinct male voices
F1–F5 Female 5 distinct female voices

Configure in settings.json:

{
  "daidentity": {
    "voices": {
      "provider": "supertonic",
      "main": {
        "supertonicVoice": "M1"
      }
    }
  }
}

Performance (CPU-only, no GPU required)

Benchmarked on a 2-core Intel Skylake VM (worst case — most desktops will be faster):

Message Synthesis time End-to-end (+ playback)
Short (3 words) ~1.6s ~3.5s
Medium (8 words) ~2.0s ~5.5s
Long (12 words) ~2.0s ~5.5s
  • First run: adds ~10–30s for model download (~386 MB), then cached permanently
  • CPU usage: uses all available cores during synthesis (~8s user time on 2 cores = full parallel), then idle
  • Memory: ~200 MB RSS during synthesis

For comparison, ElevenLabs cloud TTS takes ~1–2s network round-trip but costs $0.30/1K characters.


New: Cross-platform audio playback

Audio player discovery chain (first available wins):

Player Platform Package
paplay Linux (PulseAudio/PipeWire) pulseaudio-utils or pipewire-pulse
ffplay Universal (FFmpeg) ffmpeg
afplay macOS Built-in

Linux system dependencies:

# Ubuntu/Debian
sudo apt install pulseaudio-utils libnotify-bin

# Fedora
sudo dnf install pulseaudio-utils libnotify

# Arch
sudo pacman -S libpulse libnotify

New: Linux desktop notifications

  • notify-send on Linux (libnotify) — visual popup alongside audio
  • osascript on macOS (existing behavior preserved)

Homeserver → Desktop audio routing

For users running PAI on a headless server (VM, NAS, homelab), voice audio can play on a remote desktop machine via PulseAudio/PipeWire network streaming:

On the desktop (audio sink):

# PulseAudio: allow network connections
pactl load-module module-native-protocol-tcp auth-anonymous=1

# PipeWire: add to ~/.config/pipewire/pipewire-pulse.conf.d/network.conf
# context.modules = [{ name = libpipewire-module-protocol-pulse
#   args = { server.address = ["unix:native", "tcp:4713"] } }]

On the server (PAI host):

# Add to ~/.claude/.env or shell profile
export PULSE_SERVER=tcp:<DESKTOP_IP>:4713

Audio from paplay/ffplay on the server routes to the desktop's speakers over the LAN. Works with both WAV (Supertonic) and MP3 (ElevenLabs).


Troubleshooting

Issue Fix
No audio player found Install pulseaudio-utils (Linux) or ffmpeg
Supertonic TTS failed Check .venv/bin/python exists; re-run pip install supertonic
Voice: Supertonic not installed — falling back to elevenlabs Normal if you haven't installed Supertonic; set provider: "elevenlabs" to suppress
No sound on remote server Set PULSE_SERVER=tcp:<desktop-ip>:4713 in .env
Connection refused on PulseAudio TCP Run pactl load-module module-native-protocol-tcp on the desktop

Backward compatibility

  • ElevenLabs users: set "provider": "elevenlabs" in settings.json — everything works exactly as before
  • macOS users: afplay + osascript still in the discovery chain — zero behavior change
  • No Supertonic installed: auto-fallback to ElevenLabs with a log warning
  • All existing HTTP endpoints (/notify, /notify/personality, /voice, /voice/health) unchanged
  • 3-tier config resolution preserved (caller body → voice_id lookup → defaults)

Files changed

File Change
VoiceServer/voice.ts Dual-provider architecture, Linux audio/notification support
VoiceServer/supertonic-tts.py NEW — Python wrapper for Supertonic TTS synthesis

Testing

Verified on:

  • Ubuntu 24.04 (paplay + notify-send) with Supertonic provider
  • Homeserver → desktop audio routing via PulseAudio TCP (VM → desktop over LAN)
  • ElevenLabs fallback when Supertonic not installed
  • macOS compatibility preserved (afplay + osascript in discovery chain)

…with Linux support

- Add Supertonic as local CPU-based TTS provider (zero cost, no API key needed)
- Add Linux audio playback: paplay (PulseAudio) → ffplay (FFmpeg) → afplay (macOS)
- Add Linux desktop notifications via notify-send
- Add VoiceProvider type for provider selection in settings.json
- Add per-voice Supertonic voice mapping (M1-M5, F1-F5)
- Add supertonic-tts.py wrapper script
- Preserve full backward compatibility with ElevenLabs-only setups
- Auto-fallback: if Supertonic not installed, falls back to ElevenLabs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant