Skip to content

Commit c6dd481

Browse files
committed
feat(tts): Play/Stop toggle, generating state, timestamped logs, CJK/Hindi → Web Speech API
- Merged Play/Stop into single toggle button (▷ Play ↔ ■ Stop) - Run button shows ⏳ Generating… state, disables all other buttons - Added _ttsT() elapsed-time timestamps to all console logs - Fixed critical bug: synthesizing status was resetting modelReady=false - Moved Japanese, Chinese, Hindi to Web Speech API (espeak-ng WASM can't phonemize non-Latin scripts) - generate() now handles Web Speech API completion via polling - Updated Kokoro model ID from v1.1-zh to v1.0 - CSS: generating pulse animation, disabled button styles, dark mode
1 parent 1721d1b commit c6dd481

9 files changed

Lines changed: 403 additions & 92 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
| **🎬 Media Embedding** | Video playback via `![alt](video.mp4)` image syntax (`.mp4`, `.webm`, `.ogg`, `.mov`, `.m4v`); YouTube/Vimeo embeds auto-detected; `embed` code block for responsive media grids (`cols=1-4`, `height=N`); Video.js v10 lazy-loaded with native `<video>` fallback; website URLs render as rich link preview cards with favicon + "Open ↗" button |
2929
| **🤖 AI Assistant** | 3 local Qwen 3.5 sizes (0.8B / 2B / 4B via WebGPU/WASM), Gemini 3.1 Flash Lite, Groq Llama 3.3 70B, OpenRouter — summarize, expand, rephrase, grammar-fix, explain, simplify, auto-complete; AI writing tags (Polish, Formalize, Elaborate, Shorten, Image); enhanced context menu; per-card model selection; concurrent block generation; inline review with accept/reject/regenerate; AI-powered image generation; **smart model loading UX** — cache vs download detection (📦/⬇️), HuggingFace source location display, delete cached models from browser storage; all models hosted on [`textagent` HuggingFace org](https://huggingface.co/textagent) with automatic fallback |
3030
| **🎤 Voice Dictation** | Dual-engine speech-to-text: **Voxtral Mini 3B** (WebGPU, primary, 13 languages, ~2.7 GB) or **Whisper Large V3 Turbo** (WASM fallback, ~800 MB) with consensus scoring; download consent popup with model info before first use; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; streaming partial results |
31-
| **🔊 Text-to-Speech** | Hybrid Kokoro TTS engine — English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker), Japanese & 10+ languages via Web Speech API fallback; TTS card with separate ▶ Run (generate audio) / ▷ Play (replay) / 💾 Save (WAV download) buttons; hover any preview text and click 🔊 to hear pronunciation; voice auto-selection by language |
31+
| **🔊 Text-to-Speech** | Hybrid Kokoro TTS engine — 9 languages (English, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese) via [Kokoro 82M v1.0 ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.0-ONNX) (~80 MB, off-thread WebWorker), Korean, German & others via Web Speech API fallback; TTS card with separate ▶ Run (generate audio) / ▷ Play (replay) / 💾 Save (WAV download) buttons; hover any preview text and click 🔊 to hear pronunciation; voice auto-selection by language |
3232
| **Import** | MD, DOCX, XLSX/XLS, CSV, HTML, JSON, XML, PDF — drag & drop or click to import |
3333
| **Export** | Markdown, self-contained styled HTML, PDF (smart page-breaks, shared rendering pipeline), LLM Memory (5 formats: XML, JSON, Compact JSON, Markdown, Plain Text + shareable link) |
3434
| **Sharing** | AES-256-GCM encrypted sharing via Firebase; read-only shared links, optional passphrase protection — decryption key stays in URL fragment (never sent to server) |
@@ -62,7 +62,7 @@ TextAgent includes a built-in AI assistant panel with **three local model sizes*
6262
| **Gemini 3.1 Flash Lite** | Google (free tier) | ☁️ Cloud — 1M tokens/min | 🚀 Very Fast |
6363
| **Llama 3.3 70B** | Groq (free tier) | ☁️ Cloud — ultra-low latency | ⚡ Ultra Fast |
6464
| **Auto · Best Free** | OpenRouter (free tier) | ☁️ Cloud — multi-model routing | 🧠 Powerful |
65-
| **Kokoro TTS (82M)** | Local (WebWorker) | 🔒 Private — English + Chinese · ~80 MB | 🔊 Speech |
65+
| **Kokoro TTS (82M)** | Local (WebWorker) | 🔒 Private — 9 Languages · ~80 MB | 🔊 Speech |
6666
| **Voxtral STT (3B)** | Local (WebGPU) | 🔒 Private — 13 languages · ~2.7 GB | 🎤 Dictation |
6767
| **Granite Docling (258M)** | Local (WebGPU/WASM) | 🔒 Private — document OCR · ~500 MB | 📄 Document |
6868
| **Florence-2 (230M)** | Local (WebGPU/WASM) | 🔒 Private — OCR + captioning · ~230 MB | 📷 Vision |
@@ -479,7 +479,7 @@ TextAgent has undergone significant evolution since its inception. What started
479479
| **2026-03-12** | `f7ca256` | 🎤 **Voxtral STT**[Voxtral Mini 3B](https://huggingface.co/textagent/Voxtral-Mini-3B-2507-ONNX) as primary speech-to-text engine on WebGPU (~2.7 GB, q4, 13 languages, streaming partial output via `TextStreamer`); Whisper Large V3 Turbo as WASM fallback (~800 MB, q8); `voxtral-worker.js` new WebWorker with `VoxtralForConditionalGeneration` + `VoxtralProcessor`; `speechToText.js` WebGPU detection + dual-worker routing; download consent popup (`showSttConsentPopup`) with model name/size/privacy info before first download; `STT_CONSENTED` localStorage key; model duplicated to `textagent/` HuggingFace org with `onnx-community/` fallback |
480480
| **2026-03-12** | `0f58296` | 🛡️ **Code Audit Fixes** — sandboxed `jsAdapter` in `exec-sandbox.js` (was raw `eval()` on main thread, now iframe-sandboxed); `mirror-models.sh` model IDs updated to `textagent`, Kokoro v1.0→v1.1-zh, GitLab refs removed; Whisper speech worker forwarded user's language selection instead of hardcoded English; shared `ai-worker-common.js` module extracts `TOKEN_LIMITS` + `buildMessages()` from 3 workers; cloud workers load as ES modules |
481481
| **2026-03-12** | `591467b` | 🏠 **Model Hosting Migration** — all 7 ONNX models (Qwen 3.5 0.8B/2B/4B, Qwen 3 4B Thinking, Whisper Large V3 Turbo, Kokoro 82M v1.0/v1.1-zh) duplicated to self-owned [`textagent` HuggingFace org](https://huggingface.co/textagent); model IDs updated from `onnx-community/` to `textagent/` across all workers; automatic fallback to `onnx-community/` namespace if textagent models unavailable; GitLab mirror removed from runtime code |
482-
| **2026-03-12** | `7b9f846` | 🔊 **Kokoro TTS** — hybrid text-to-speech engine: English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker via `kokoro-js`), Japanese & 10+ languages via Web Speech API fallback; hover preview text → click 🔊 for pronunciation; voice auto-selection by language; `textToSpeech.js` main module + `tts-worker.js` WebWorker + `tts.css` styling; model-hosts.js for configurable hosting with auto-fallback |
482+
| **2026-03-12** | `7b9f846` | 🔊 **Kokoro TTS** — hybrid text-to-speech engine: 9 languages (English, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese) via [Kokoro 82M v1.0 ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.0-ONNX) (~80 MB, off-thread WebWorker via `kokoro-js`), Korean, German & others via Web Speech API fallback; hover preview text → click 🔊 for pronunciation; voice auto-selection by language; `textToSpeech.js` main module + `tts-worker.js` WebWorker + `tts.css` styling; model-hosts.js for configurable hosting with auto-fallback |
483483
| **2026-03-12** | `7b9f846` | 📷 **OCR Tag** — new `{{@OCR:}}` document tag for image-to-text extraction; amber-accented card with mode pills (Text/Math/Table); 📎 image upload with `@upload:` editor sync; Qwen model default; vision-capable model flags (`supportsVision`) on Qwen 3.5 Flash, 35B-A3B, and DeepSeek V3.2 |
484484
| **2026-03-12** | `7b9f846`, `1ec8b90` | 🏗️ **Model Architecture** — ai-worker.js refactored for architecture-aware loading (`qwen3` text-only vs `qwen3_5` vision); `setModelId` accepts `architecture` + `dtype` params; automatic fallback to HuggingFace when primary host fails; `moonshine-medium-worker.js` deleted (replaced by unified `speech-worker.js`); Language Learning template with TTS pronunciation tips; SQLite-compatible SQL in Technical template |
485485
| **2026-03-11** | `7b9f846` |**Run All Notebook Engine** — one-click `▶ Run All` button executes every code/tag block in document order; 11 runtime adapters (bash, math, python, html, js, sql, docgen-ai, docgen-image, docgen-agent, api, linux-script); Block Registry with FNV-1a stable IDs; Execution Controller with fixed-bottom progress bar, per-block status badges (pending/running/done/error), and abort support; SQLite `_exec_results` context store for cross-block data sharing; DocGen/API adapters use auto-accept mode (skip review panel); Linux adapter submits to Judge0 CE; deferred adapter queue for module loading order; `exec-engine.css` styling; 12 new Playwright tests (191 total) |

changelogs/CHANGELOG-tts-ux.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Changelog: TTS Card UX & Multilingual Routing
2+
3+
**Date:** 2026-03-15
4+
5+
## Summary
6+
7+
Major overhaul of the TTS card user experience: merged Play/Stop into a single toggle button, added a generating state that disables all buttons during synthesis, added detailed timestamped console logs, and fixed a critical bug where non-Latin languages (Japanese, Chinese, Hindi) couldn't be phonemized by Kokoro's espeak-ng WASM. These languages now route to Web Speech API for proper pronunciation.
8+
9+
## Changes
10+
11+
### `js/textToSpeech.js` (+146 lines)
12+
- **`_isGenerating` state** — tracks whether audio synthesis is in progress
13+
- **`onGenerateComplete` callback** — one-shot callback for UI to know when generation finishes
14+
- **`_ttsT()` timestamped logs** — every TTS log now shows elapsed time since page load (`🔊 [TTS +12.3s]`)
15+
- **Fixed synthesizing status bug**`loadingPhase: 'synthesizing'` from worker no longer resets `modelReady=false` (was breaking all subsequent TTS)
16+
- **Moved Japanese, Chinese, Hindi out of `KOKORO_LANGS`** — espeak-ng WASM can't phonemize CJK/Devanagari scripts; routes to Web Speech API instead
17+
- **`generate()` handles Web Speech API** — polls `speechSynthesis.speaking` for completion, fires callback to re-enable UI buttons
18+
- **Added `hi-IN`, `ja-JP` to `WEB_SPEECH_LANG_MAP`** — proper BCP-47 codes for Hindi and Japanese
19+
20+
### `js/tts-worker.js` (+108 lines)
21+
- **Synthesis timing logs** — logs when speak request is received, voice selected, and synthesis duration
22+
- **`loadingPhase: 'synthesizing'` status** — progress message during audio generation
23+
24+
### `js/ai-docgen.js` (+117 lines)
25+
- **Play/Stop → single toggle button** (`ai-tts-play-toggle`) — ▷ Play ↔ ■ Stop with auto-reset on playback finish
26+
- **Run button generating state** — text changes to "⏳ Generating…", all other buttons disabled during synthesis
27+
- **`onGenerateComplete` integration** — restores UI state when generation completes or errors
28+
- **Web Speech API toast** — shows "Spoken via Web Speech API" for non-Kokoro languages
29+
30+
### `css/tts.css` (+59 lines)
31+
- **Play/Stop toggle styles** — purple (Play) ↔ red (Stop) with smooth transitions
32+
- **Generating state animation** — pulsing amber border + disabled button styles
33+
- **Dark mode support** — updated selectors for toggle states
34+
35+
### `js/ai-models.js` (minor)
36+
- Updated Kokoro model description to "9 Languages" and changed model ID from `v1.1-zh-ONNX` to `v1.0-ONNX`
37+
38+
### `js/model-hosts.js` (minor)
39+
- Updated comment reference from `v1.1-zh-ONNX` to `v1.0-ONNX`
40+
41+
### `scripts/mirror-models.sh` (minor)
42+
- Updated mirror script for Kokoro v1.0 model ID
43+
44+
### `README.md` (minor)
45+
- Updated TTS feature description and model table for 9-language Kokoro + Web Speech API hybrid

css/tts.css

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,32 +112,38 @@
112112
box-shadow: 0 0 0 2px rgba(139, 92, 246, 0.15);
113113
}
114114

115-
/* Play / Stop buttons */
116-
.ai-tts-play,
117-
.ai-tts-stop {
115+
/* Play/Stop toggle button */
116+
.ai-tts-play-toggle {
118117
font-weight: 600;
119118
font-size: 0.72rem;
120119
letter-spacing: 0.02em;
121-
}
122-
123-
.ai-tts-play {
124120
color: #8b5cf6 !important;
125121
border-color: rgba(139, 92, 246, 0.3) !important;
122+
transition: color 0.2s ease, border-color 0.2s ease, background 0.2s ease;
126123
}
127124

128-
.ai-tts-play:hover {
125+
.ai-tts-play-toggle:hover:not(:disabled) {
129126
background: rgba(139, 92, 246, 0.12) !important;
130127
}
131128

132-
.ai-tts-stop {
129+
/* Playing state — red Stop */
130+
.ai-tts-play-toggle.ai-tts-playing {
133131
color: #ef4444 !important;
134132
border-color: rgba(239, 68, 68, 0.3) !important;
135133
}
136134

137-
.ai-tts-stop:hover {
135+
.ai-tts-play-toggle.ai-tts-playing:hover:not(:disabled) {
138136
background: rgba(239, 68, 68, 0.12) !important;
139137
}
140138

139+
/* Run button */
140+
.ai-tts-run {
141+
font-weight: 600;
142+
font-size: 0.72rem;
143+
letter-spacing: 0.02em;
144+
transition: opacity 0.2s ease;
145+
}
146+
141147
/* Speaking state — pulse animation */
142148
.ai-tts-speaking {
143149
box-shadow: 0 0 0 2px rgba(139, 92, 246, 0.2);
@@ -149,6 +155,33 @@
149155
50% { box-shadow: 0 0 0 4px rgba(139, 92, 246, 0.25); }
150156
}
151157

158+
/* Generating state — pulsing border while synthesizing */
159+
.ai-tts-generating {
160+
border-left-color: #f59e0b !important;
161+
animation: tts-generating-pulse 1s ease-in-out infinite;
162+
}
163+
164+
@keyframes tts-generating-pulse {
165+
0%, 100% { box-shadow: 0 0 0 2px rgba(245, 158, 11, 0.15); }
166+
50% { box-shadow: 0 0 0 4px rgba(245, 158, 11, 0.3); }
167+
}
168+
169+
/* Disabled buttons during generation */
170+
.ai-tts-card .ai-placeholder-btn:disabled,
171+
.ai-tts-card select:disabled {
172+
opacity: 0.4;
173+
cursor: not-allowed;
174+
pointer-events: none;
175+
}
176+
177+
.ai-tts-card .ai-tts-run:disabled {
178+
opacity: 0.7;
179+
cursor: wait;
180+
pointer-events: none;
181+
color: #f59e0b !important;
182+
border-color: rgba(245, 158, 11, 0.3) !important;
183+
}
184+
152185
/* Toolbar TTS button accent */
153186
.fmt-tts-btn {
154187
color: #8b5cf6 !important;
@@ -169,14 +202,18 @@
169202
border-color: rgba(167, 139, 250, 0.3);
170203
}
171204

172-
[data-theme="dark"] .ai-tts-play {
205+
[data-theme="dark"] .ai-tts-play-toggle {
173206
color: #a78bfa !important;
174207
}
175208

176-
[data-theme="dark"] .ai-tts-stop {
209+
[data-theme="dark"] .ai-tts-play-toggle.ai-tts-playing {
177210
color: #f87171 !important;
178211
}
179212

213+
[data-theme="dark"] .ai-tts-run:disabled {
214+
color: #fbbf24 !important;
215+
}
216+
180217
/* ============================================
181218
TTS Download Button
182219
============================================ */

0 commit comments

Comments
 (0)