Skip to content

Commit f7ca256

Browse files
committed
feat: Voxtral Mini 3B as primary STT (WebGPU) + Whisper WASM fallback + download consent popup
- New voxtral-worker.js: VoxtralForConditionalGeneration + VoxtralProcessor, q4 quantized, streaming partial output - speech-worker.js simplified to WASM-only Whisper fallback - speechToText.js: WebGPU detection, dual-worker routing, dynamic engine labels - Download consent popup: model name, size, device, privacy info before first use - Model mirrored to textagent/Voxtral-Mini-3B-2507-ONNX with onnx-community fallback - ai-models.js: voxtral-stt entry with requiresWebGPU flag - storage-keys.js: STT_CONSENTED key - README updated with Voxtral STT in features, models table, and release notes
1 parent ce6051d commit f7ca256

9 files changed

Lines changed: 511 additions & 45 deletions

File tree

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
| **Rendering** | GitHub-style Markdown, syntax highlighting (180+ languages), LaTeX math (MathJax), Mermaid diagrams (zoom/pan/export), PlantUML diagrams, callout blocks, footnotes, emoji, anchor links |
2828
| **🎬 Media Embedding** | Video playback via `![alt](video.mp4)` image syntax (`.mp4`, `.webm`, `.ogg`, `.mov`, `.m4v`); YouTube/Vimeo embeds auto-detected; `embed` code block for responsive media grids (`cols=1-4`, `height=N`); Video.js v10 lazy-loaded with native `<video>` fallback; website URLs render as rich link preview cards with favicon + "Open ↗" button |
2929
| **🤖 AI Assistant** | 3 local Qwen 3.5 sizes (0.8B / 2B / 4B via WebGPU/WASM), Gemini 3.1 Flash Lite, Groq Llama 3.3 70B, OpenRouter — summarize, expand, rephrase, grammar-fix, explain, simplify, auto-complete; AI writing tags (Polish, Formalize, Elaborate, Shorten, Image); enhanced context menu; per-card model selection; concurrent block generation; inline review with accept/reject/regenerate; AI-powered image generation |
30-
| **🎤 Voice Dictation** | Dual-engine speech-to-text (Web Speech API + Whisper Large V3 Turbo ONNX) with consensus scoring; WebGPU acceleration (fp16) with WASM fallback; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; hallucination filtering; streaming partial results |
30+
| **🎤 Voice Dictation** | Dual-engine speech-to-text: **Voxtral Mini 3B** (WebGPU, primary, 13 languages, ~2.7 GB) or **Whisper Large V3 Turbo** (WASM fallback, ~800 MB) with consensus scoring; download consent popup with model info before first use; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; streaming partial results |
3131
| **🔊 Text-to-Speech** | Hybrid Kokoro TTS engine — English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/onnx-community/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker), Japanese & 10+ languages via Web Speech API fallback; hover any preview text and click 🔊 to hear pronunciation; voice auto-selection by language |
3232
| **Import** | MD, DOCX, XLSX/XLS, CSV, HTML, JSON, XML, PDF — drag & drop or click to import |
3333
| **Export** | Markdown, self-contained styled HTML, PDF (smart page-breaks, shared rendering pipeline), LLM Memory (5 formats: XML, JSON, Compact JSON, Markdown, Plain Text + shareable link) |
@@ -62,6 +62,7 @@ TextAgent includes a built-in AI assistant panel with **three local model sizes*
6262
| **Llama 3.3 70B** | Groq (free tier) | ☁️ Cloud — ultra-low latency | ⚡ Ultra Fast |
6363
| **Auto · Best Free** | OpenRouter (free tier) | ☁️ Cloud — multi-model routing | 🧠 Powerful |
6464
| **Kokoro TTS (82M)** | Local (WebWorker) | 🔒 Private — English + Chinese · ~80 MB | 🔊 Speech |
65+
| **Voxtral STT (3B)** | Local (WebGPU) | 🔒 Private — 13 languages · ~2.7 GB | 🎤 Dictation |
6566

6667
**AI Actions:** Summarize · Expand · Rephrase · Fix Grammar · Explain · Simplify · Auto-complete · Generate Markdown · Polish · Formalize · Elaborate · Shorten
6768

@@ -247,7 +248,7 @@ Import files directly — they're auto-converted to Markdown client-side:
247248
<details open>
248249
<summary><strong>🎤 Voice Dictation — Speak Your Markdown</strong></summary>
249250

250-
**Hands-free writing with Markdown awareness.** Dual-engine ASR combines Web Speech API and Whisper Large V3 Turbo (WER ~7.7%) with consensus scoring. WebGPU GPU acceleration with WASM fallback. 50+ voice commands with natural phrases — say "heading one" or "title" for H1, "bold text end bold" for **text**, "add table" for a markdown table, "undo" to take it back. Auto-punctuation adds capitalization and periods, with LLM refinement when a model is loaded.
251+
**Hands-free writing with Markdown awareness.** Triple-engine ASR combines Web Speech API, Voxtral Mini 3B (WebGPU, primary, 13 languages) or Whisper Large V3 Turbo (WASM fallback) with consensus scoring. Download consent popup shows model size and privacy info before first use. 50+ voice commands with natural phrases — say "heading one" or "title" for H1, "bold text end bold" for **text**, "add table" for a markdown table, "undo" to take it back. Auto-punctuation adds capitalization and periods, with LLM refinement when a model is loaded.
251252

252253
<img src="public/assets/demos/14_voice_dictation.webp" alt="Voice Dictation — speech-to-text with Markdown-aware commands" width="100%">
253254

@@ -455,6 +456,7 @@ TextAgent has undergone significant evolution since its inception. What started
455456

456457
| Date | Commits | Feature / Update |
457458
|------|---------|-----------------|
459+
| **2026-03-12** || 🎤 **Voxtral STT**[Voxtral Mini 3B](https://huggingface.co/textagent/Voxtral-Mini-3B-2507-ONNX) as primary speech-to-text engine on WebGPU (~2.7 GB, q4, 13 languages, streaming partial output via `TextStreamer`); Whisper Large V3 Turbo as WASM fallback (~800 MB, q8); `voxtral-worker.js` new WebWorker with `VoxtralForConditionalGeneration` + `VoxtralProcessor`; `speechToText.js` WebGPU detection + dual-worker routing; download consent popup (`showSttConsentPopup`) with model name/size/privacy info before first download; `STT_CONSENTED` localStorage key; model duplicated to `textagent/` HuggingFace org with `onnx-community/` fallback |
458460
| **2026-03-12** || 🛡️ **Code Audit Fixes** — sandboxed `jsAdapter` in `exec-sandbox.js` (was raw `eval()` on main thread, now iframe-sandboxed); `mirror-models.sh` model IDs updated to `textagent`, Kokoro v1.0→v1.1-zh, GitLab refs removed; Whisper speech worker forwarded user's language selection instead of hardcoded English; shared `ai-worker-common.js` module extracts `TOKEN_LIMITS` + `buildMessages()` from 3 workers; cloud workers load as ES modules |
459461
| **2026-03-12** || 🏠 **Model Hosting Migration** — all 7 ONNX models (Qwen 3.5 0.8B/2B/4B, Qwen 3 4B Thinking, Whisper Large V3 Turbo, Kokoro 82M v1.0/v1.1-zh) duplicated to self-owned [`textagent` HuggingFace org](https://huggingface.co/textagent); model IDs updated from `onnx-community/` to `textagent/` across all workers; automatic fallback to `onnx-community/` namespace if textagent models unavailable; GitLab mirror removed from runtime code |
460462
| **2026-03-12** || 🔊 **Kokoro TTS** — hybrid text-to-speech engine: English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker via `kokoro-js`), Japanese & 10+ languages via Web Speech API fallback; hover preview text → click 🔊 for pronunciation; voice auto-selection by language; `textToSpeech.js` main module + `tts-worker.js` WebWorker + `tts.css` styling; model-hosts.js for configurable hosting with auto-fallback |
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# CHANGELOG — Voxtral STT Integration
2+
3+
## Date: 2026-03-12
4+
5+
### Summary
6+
7+
Integrated Voxtral Mini 3B as the primary speech-to-text engine on WebGPU-capable devices, keeping Whisper Large V3 Turbo as the WASM fallback for non-WebGPU browsers. Added a download consent popup that informs users of model size, device, and privacy before initiating any model download.
8+
9+
### Changes
10+
11+
#### New Files
12+
- **`js/voxtral-worker.js`** — New WebWorker for Voxtral Mini 3B (WebGPU). Uses `VoxtralForConditionalGeneration` + `VoxtralProcessor` from `@huggingface/transformers` with q4 quantization. Supports streaming partial output via `TextStreamer` for real-time interim text. Primary source: `textagent/Voxtral-Mini-3B-2507-ONNX`, fallback: `onnx-community/`.
13+
14+
#### Modified Files
15+
- **`js/speech-worker.js`** — Simplified to WASM-only Whisper fallback. Removed WebGPU detection logic (now handled by `speechToText.js`). Always uses `device: 'wasm'`, `dtype: 'q8'`.
16+
- **`js/speechToText.js`** — Added WebGPU detection at module load. Dual-worker routing: spawns `voxtral-worker.js` on WebGPU, `speech-worker.js` on WASM. Added download consent popup (`showSttConsentPopup`) that shows model name, size (~2.7 GB / ~800 MB), device (GPU/CPU), and privacy info before download. Consent remembered via `localStorage`. Dynamic engine labels throughout.
17+
- **`js/ai-models.js`** — Added `voxtral-stt` model entry for the models card UI with `requiresWebGPU: true`.
18+
- **`js/storage-keys.js`** — Added `STT_CONSENTED` key for tracking user consent to STT model download.
19+
- **`css/speech.css`** — Added polished consent popup CSS (glassmorphism overlay, info table, gradient download button, responsive mobile layout).
20+
- **`scripts/mirror-models.sh`** — Added `textagent/Voxtral-Mini-3B-2507-ONNX` to the self-hosted mirror list.
21+
22+
### HuggingFace
23+
- Duplicated `onnx-community/Voxtral-Mini-3B-2507-ONNX``textagent/Voxtral-Mini-3B-2507-ONNX` on HuggingFace.
24+
25+
### Architecture
26+
- WebGPU detection → Voxtral (q4 WebGPU, ~2.7 GB) as primary
27+
- Non-WebGPU → Whisper V3 Turbo (q8 WASM, ~800 MB) as fallback
28+
- Web Speech API runs immediately on both paths
29+
- Download consent popup shown on first mic click (remembered in localStorage)

css/speech.css

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,4 +362,169 @@
362362
width: auto;
363363
max-height: 60vh;
364364
}
365+
}
366+
367+
/* --- STT Model Download Consent Popup --- */
368+
.stt-consent-overlay {
369+
position: fixed;
370+
inset: 0;
371+
background: rgba(0, 0, 0, 0.5);
372+
backdrop-filter: blur(4px);
373+
z-index: 10000;
374+
display: flex;
375+
align-items: center;
376+
justify-content: center;
377+
opacity: 0;
378+
transition: opacity 0.25s ease;
379+
}
380+
381+
.stt-consent-overlay.stt-consent-show {
382+
opacity: 1;
383+
}
384+
385+
.stt-consent-popup {
386+
background: var(--bg-color, #fff);
387+
border: 1px solid var(--border-color, #ddd);
388+
border-radius: 16px;
389+
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.25);
390+
max-width: 420px;
391+
width: 90%;
392+
overflow: hidden;
393+
transform: scale(0.95);
394+
transition: transform 0.25s ease;
395+
}
396+
397+
.stt-consent-show .stt-consent-popup {
398+
transform: scale(1);
399+
}
400+
401+
[data-theme="dark"] .stt-consent-popup {
402+
background: #1a1a2e;
403+
border-color: #2d2d4a;
404+
box-shadow: 0 20px 60px rgba(0, 0, 0, 0.6);
405+
}
406+
407+
.stt-consent-header {
408+
padding: 16px 20px;
409+
font-size: 16px;
410+
font-weight: 700;
411+
color: var(--text-color, #333);
412+
border-bottom: 1px solid var(--border-color, #eee);
413+
display: flex;
414+
align-items: center;
415+
gap: 8px;
416+
}
417+
418+
.stt-consent-header i {
419+
color: #6366f1;
420+
font-size: 18px;
421+
}
422+
423+
[data-theme="dark"] .stt-consent-header {
424+
border-bottom-color: #2d2d4a;
425+
}
426+
427+
.stt-consent-body {
428+
padding: 16px 20px;
429+
}
430+
431+
.stt-consent-body p {
432+
margin: 0 0 12px 0;
433+
font-size: 13px;
434+
color: var(--text-color, #555);
435+
line-height: 1.5;
436+
}
437+
438+
.stt-consent-info {
439+
width: 100%;
440+
border-collapse: collapse;
441+
margin: 12px 0;
442+
}
443+
444+
.stt-consent-info td {
445+
padding: 6px 0;
446+
font-size: 13px;
447+
color: var(--text-color, #555);
448+
border-bottom: 1px solid var(--border-color, #f0f0f0);
449+
}
450+
451+
.stt-consent-info td:first-child {
452+
width: 80px;
453+
opacity: 0.6;
454+
font-size: 12px;
455+
}
456+
457+
[data-theme="dark"] .stt-consent-info td {
458+
border-bottom-color: #2d2d4a;
459+
}
460+
461+
.stt-consent-note {
462+
font-size: 11px !important;
463+
opacity: 0.6;
464+
margin-top: 8px !important;
465+
}
466+
467+
.stt-consent-actions {
468+
padding: 12px 20px 16px;
469+
display: flex;
470+
justify-content: flex-end;
471+
gap: 10px;
472+
}
473+
474+
.stt-consent-cancel {
475+
padding: 8px 18px;
476+
border-radius: 8px;
477+
border: 1px solid var(--border-color, #ddd);
478+
background: transparent;
479+
color: var(--text-color, #666);
480+
font-size: 13px;
481+
cursor: pointer;
482+
transition: all 0.15s;
483+
}
484+
485+
.stt-consent-cancel:hover {
486+
background: var(--button-hover, #f5f5f5);
487+
}
488+
489+
[data-theme="dark"] .stt-consent-cancel:hover {
490+
background: #2d2d4a;
491+
}
492+
493+
.stt-consent-download {
494+
padding: 8px 20px;
495+
border-radius: 8px;
496+
border: none;
497+
background: linear-gradient(135deg, #6366f1, #8b5cf6);
498+
color: #fff;
499+
font-size: 13px;
500+
font-weight: 600;
501+
cursor: pointer;
502+
display: flex;
503+
align-items: center;
504+
gap: 6px;
505+
transition: all 0.2s;
506+
box-shadow: 0 2px 8px rgba(99, 102, 241, 0.3);
507+
}
508+
509+
.stt-consent-download:hover {
510+
background: linear-gradient(135deg, #4f46e5, #7c3aed);
511+
box-shadow: 0 4px 12px rgba(99, 102, 241, 0.4);
512+
transform: translateY(-1px);
513+
}
514+
515+
@media (max-width: 480px) {
516+
.stt-consent-popup {
517+
width: 95%;
518+
border-radius: 12px;
519+
}
520+
521+
.stt-consent-actions {
522+
flex-direction: column-reverse;
523+
}
524+
525+
.stt-consent-download,
526+
.stt-consent-cancel {
527+
width: 100%;
528+
justify-content: center;
529+
}
365530
}

js/ai-models.js

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -280,5 +280,20 @@
280280
downloadSize: '~80 MB',
281281
},
282282

283+
// ── Local: Voxtral Mini 3B STT (Speech-to-Text, WebGPU) ──
284+
'voxtral-stt': {
285+
label: 'Voxtral STT · Local',
286+
badge: 'Voxtral STT · Local',
287+
icon: 'bi bi-mic-fill',
288+
statusReady: 'Voxtral Mini 3B STT · Local',
289+
dropdownName: 'Voxtral STT (3B)',
290+
dropdownDesc: 'Local · WebGPU · 13 Languages · ~2.7 GB',
291+
isLocal: true,
292+
isSttModel: true,
293+
localModelId: 'textagent/Voxtral-Mini-3B-2507-ONNX',
294+
downloadSize: '~2.7 GB',
295+
requiresWebGPU: true,
296+
},
297+
283298
};
284299
})();

js/speech-worker.js

Lines changed: 9 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
// ============================================
2-
// whisper-worker.js — Whisper Large V3 Turbo ASR WebWorker
2+
// speech-worker.js — Whisper Large V3 Turbo ASR WebWorker (WASM fallback)
3+
// Used when WebGPU is NOT available. WebGPU devices use voxtral-worker.js.
34
// Runs textagent/whisper-large-v3-turbo via @huggingface/transformers
45
// off the main thread for jank-free transcription.
5-
// WER ~7.7% (batched) — significant upgrade over Moonshine Base (~9.66%)
6+
// WER ~7.7% (batched)
67
// ============================================
78
import { pipeline, env } from '@huggingface/transformers';
89

@@ -18,26 +19,11 @@ self.addEventListener('message', async (e) => {
1819

1920
if (type === 'init') {
2021
try {
21-
// Detect best available device
22-
let device = 'wasm';
23-
let dtype = 'q8';
24-
if (typeof navigator !== 'undefined' && navigator.gpu) {
25-
try {
26-
const adapter = await navigator.gpu.requestAdapter();
27-
if (adapter) {
28-
device = 'webgpu';
29-
dtype = 'fp16'; // WebGPU works best with fp16
30-
self.postMessage({ type: 'status', status: 'loading', message: '🚀 WebGPU available — using GPU acceleration' });
31-
}
32-
} catch (_) { /* fall through to WASM */ }
33-
}
34-
if (device === 'wasm') {
35-
self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo…' });
36-
}
22+
self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' });
3723

3824
const pipelineOpts = {
39-
dtype,
40-
device,
25+
dtype: 'q8',
26+
device: 'wasm',
4127
progress_callback: (progress) => {
4228
if (progress.status === 'progress') {
4329
self.postMessage({
@@ -75,8 +61,9 @@ self.addEventListener('message', async (e) => {
7561
self.postMessage({
7662
type: 'status',
7763
status: 'ready',
78-
message: 'Model ready',
79-
device: device === 'webgpu' ? 'GPU (WebGPU)' : 'CPU (WASM)',
64+
message: 'Whisper ready',
65+
device: 'CPU (WASM)',
66+
model: 'Whisper V3 Turbo',
8067
});
8168
} catch (err) {
8269
self.postMessage({ type: 'error', message: err.message || String(err) });

0 commit comments

Comments
 (0)