Textagent
diff --git a/‎README.md‎
Lines changed: 4 additions & 2 deletions b/‎README.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎changelogs/CHANGELOG-voxtral-stt.md‎
Lines changed: 29 additions & 0 deletions b/‎changelogs/CHANGELOG-voxtral-stt.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎css/speech.css‎
Lines changed: 165 additions & 0 deletions b/‎css/speech.css‎
Lines changed: 165 additions & 0 deletions
diff --git a/‎js/ai-models.js‎
Lines changed: 15 additions & 0 deletions b/‎js/ai-models.js‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎js/speech-worker.js‎
Lines changed: 9 additions & 22 deletions b/‎js/speech-worker.js‎
Lines changed: 9 additions & 22 deletions
@@ -27,7 +27,7 @@
 | **Rendering** | GitHub-style Markdown, syntax highlighting (180+ languages), LaTeX math (MathJax), Mermaid diagrams (zoom/pan/export), PlantUML diagrams, callout blocks, footnotes, emoji, anchor links |
 | **🎬 Media Embedding** | Video playback via `![alt](video.mp4)` image syntax (`.mp4`, `.webm`, `.ogg`, `.mov`, `.m4v`); YouTube/Vimeo embeds auto-detected; `embed` code block for responsive media grids (`cols=1-4`, `height=N`); Video.js v10 lazy-loaded with native `<video>` fallback; website URLs render as rich link preview cards with favicon + "Open ↗" button |
 | **🤖 AI Assistant** | 3 local Qwen 3.5 sizes (0.8B / 2B / 4B via WebGPU/WASM), Gemini 3.1 Flash Lite, Groq Llama 3.3 70B, OpenRouter — summarize, expand, rephrase, grammar-fix, explain, simplify, auto-complete; AI writing tags (Polish, Formalize, Elaborate, Shorten, Image); enhanced context menu; per-card model selection; concurrent block generation; inline review with accept/reject/regenerate; AI-powered image generation |
-| **🎤 Voice Dictation** | Dual-engine speech-to-text (Web Speech API + Whisper Large V3 Turbo ONNX) with consensus scoring; WebGPU acceleration (fp16) with WASM fallback; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; hallucination filtering; streaming partial results |
+| **🎤 Voice Dictation** | Dual-engine speech-to-text: **Voxtral Mini 3B** (WebGPU, primary, 13 languages, ~2.7 GB) or **Whisper Large V3 Turbo** (WASM fallback, ~800 MB) with consensus scoring; download consent popup with model info before first use; 50+ Markdown-aware voice commands — natural phrases ("heading one", "bold…end bold", "add table", "undo"); auto-punctuation via AI refinement or built-in fallback; streaming partial results |
 | **🔊 Text-to-Speech** | Hybrid Kokoro TTS engine — English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/onnx-community/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker), Japanese & 10+ languages via Web Speech API fallback; hover any preview text and click 🔊 to hear pronunciation; voice auto-selection by language |
 | **Import** | MD, DOCX, XLSX/XLS, CSV, HTML, JSON, XML, PDF — drag & drop or click to import |
 | **Export** | Markdown, self-contained styled HTML, PDF (smart page-breaks, shared rendering pipeline), LLM Memory (5 formats: XML, JSON, Compact JSON, Markdown, Plain Text + shareable link) |
@@ -62,6 +62,7 @@ TextAgent includes a built-in AI assistant panel with **three local model sizes*
 | **Llama 3.3 70B** | Groq (free tier) | ☁️ Cloud — ultra-low latency | ⚡ Ultra Fast |
 | **Auto · Best Free** | OpenRouter (free tier) | ☁️ Cloud — multi-model routing | 🧠 Powerful |
 | **Kokoro TTS (82M)** | Local (WebWorker) | 🔒 Private — English + Chinese · ~80 MB | 🔊 Speech |
+| **Voxtral STT (3B)** | Local (WebGPU) | 🔒 Private — 13 languages · ~2.7 GB | 🎤 Dictation |
 
 **AI Actions:** Summarize · Expand · Rephrase · Fix Grammar · Explain · Simplify · Auto-complete · Generate Markdown · Polish · Formalize · Elaborate · Shorten
 
@@ -247,7 +248,7 @@ Import files directly — they're auto-converted to Markdown client-side:
 <details open>
 <summary><strong>🎤 Voice Dictation — Speak Your Markdown</strong></summary>
 
-**Hands-free writing with Markdown awareness.** Dual-engine ASR combines Web Speech API and Whisper Large V3 Turbo (WER ~7.7%) with consensus scoring. WebGPU GPU acceleration with WASM fallback. 50+ voice commands with natural phrases — say "heading one" or "title" for H1, "bold text end bold" for **text**, "add table" for a markdown table, "undo" to take it back. Auto-punctuation adds capitalization and periods, with LLM refinement when a model is loaded.
+**Hands-free writing with Markdown awareness.** Triple-engine ASR combines Web Speech API, Voxtral Mini 3B (WebGPU, primary, 13 languages) or Whisper Large V3 Turbo (WASM fallback) with consensus scoring. Download consent popup shows model size and privacy info before first use. 50+ voice commands with natural phrases — say "heading one" or "title" for H1, "bold text end bold" for **text**, "add table" for a markdown table, "undo" to take it back. Auto-punctuation adds capitalization and periods, with LLM refinement when a model is loaded.
 
 <img src="public/assets/demos/14_voice_dictation.webp" alt="Voice Dictation — speech-to-text with Markdown-aware commands" width="100%">
 
@@ -455,6 +456,7 @@ TextAgent has undergone significant evolution since its inception. What started
 
 | Date | Commits | Feature / Update |
 |------|---------|-----------------|
+| **2026-03-12** | — | 🎤 **Voxtral STT** — [Voxtral Mini 3B](https://huggingface.co/textagent/Voxtral-Mini-3B-2507-ONNX) as primary speech-to-text engine on WebGPU (~2.7 GB, q4, 13 languages, streaming partial output via `TextStreamer`); Whisper Large V3 Turbo as WASM fallback (~800 MB, q8); `voxtral-worker.js` new WebWorker with `VoxtralForConditionalGeneration` + `VoxtralProcessor`; `speechToText.js` WebGPU detection + dual-worker routing; download consent popup (`showSttConsentPopup`) with model name/size/privacy info before first download; `STT_CONSENTED` localStorage key; model duplicated to `textagent/` HuggingFace org with `onnx-community/` fallback |
 | **2026-03-12** | — | 🛡️ **Code Audit Fixes** — sandboxed `jsAdapter` in `exec-sandbox.js` (was raw `eval()` on main thread, now iframe-sandboxed); `mirror-models.sh` model IDs updated to `textagent`, Kokoro v1.0→v1.1-zh, GitLab refs removed; Whisper speech worker forwarded user's language selection instead of hardcoded English; shared `ai-worker-common.js` module extracts `TOKEN_LIMITS` + `buildMessages()` from 3 workers; cloud workers load as ES modules |
 | **2026-03-12** | — | 🏠 **Model Hosting Migration** — all 7 ONNX models (Qwen 3.5 0.8B/2B/4B, Qwen 3 4B Thinking, Whisper Large V3 Turbo, Kokoro 82M v1.0/v1.1-zh) duplicated to self-owned [`textagent` HuggingFace org](https://huggingface.co/textagent); model IDs updated from `onnx-community/` to `textagent/` across all workers; automatic fallback to `onnx-community/` namespace if textagent models unavailable; GitLab mirror removed from runtime code |
 | **2026-03-12** | — | 🔊 **Kokoro TTS** — hybrid text-to-speech engine: English/Chinese via [Kokoro 82M v1.1-zh ONNX](https://huggingface.co/textagent/Kokoro-82M-v1.1-zh-ONNX) (~80 MB, off-thread WebWorker via `kokoro-js`), Japanese & 10+ languages via Web Speech API fallback; hover preview text → click 🔊 for pronunciation; voice auto-selection by language; `textToSpeech.js` main module + `tts-worker.js` WebWorker + `tts.css` styling; model-hosts.js for configurable hosting with auto-fallback |
 
@@ -0,0 +1,29 @@
+# CHANGELOG — Voxtral STT Integration
+
+## Date: 2026-03-12
+
+### Summary
+
+Integrated Voxtral Mini 3B as the primary speech-to-text engine on WebGPU-capable devices, keeping Whisper Large V3 Turbo as the WASM fallback for non-WebGPU browsers. Added a download consent popup that informs users of model size, device, and privacy before initiating any model download.
+
+### Changes
+
+#### New Files
+- **`js/voxtral-worker.js`** — New WebWorker for Voxtral Mini 3B (WebGPU). Uses `VoxtralForConditionalGeneration` + `VoxtralProcessor` from `@huggingface/transformers` with q4 quantization. Supports streaming partial output via `TextStreamer` for real-time interim text. Primary source: `textagent/Voxtral-Mini-3B-2507-ONNX`, fallback: `onnx-community/`.
+
+#### Modified Files
+- **`js/speech-worker.js`** — Simplified to WASM-only Whisper fallback. Removed WebGPU detection logic (now handled by `speechToText.js`). Always uses `device: 'wasm'`, `dtype: 'q8'`.
+- **`js/speechToText.js`** — Added WebGPU detection at module load. Dual-worker routing: spawns `voxtral-worker.js` on WebGPU, `speech-worker.js` on WASM. Added download consent popup (`showSttConsentPopup`) that shows model name, size (~2.7 GB / ~800 MB), device (GPU/CPU), and privacy info before download. Consent remembered via `localStorage`. Dynamic engine labels throughout.
+- **`js/ai-models.js`** — Added `voxtral-stt` model entry for the models card UI with `requiresWebGPU: true`.
+- **`js/storage-keys.js`** — Added `STT_CONSENTED` key for tracking user consent to STT model download.
+- **`css/speech.css`** — Added polished consent popup CSS (glassmorphism overlay, info table, gradient download button, responsive mobile layout).
+- **`scripts/mirror-models.sh`** — Added `textagent/Voxtral-Mini-3B-2507-ONNX` to the self-hosted mirror list.
+
+### HuggingFace
+- Duplicated `onnx-community/Voxtral-Mini-3B-2507-ONNX` → `textagent/Voxtral-Mini-3B-2507-ONNX` on HuggingFace.
+
+### Architecture
+- WebGPU detection → Voxtral (q4 WebGPU, ~2.7 GB) as primary
+- Non-WebGPU → Whisper V3 Turbo (q8 WASM, ~800 MB) as fallback
+- Web Speech API runs immediately on both paths
+- Download consent popup shown on first mic click (remembered in localStorage)
@@ -362,4 +362,169 @@
     width: auto;
     max-height: 60vh;
   }
+}
+
+/* --- STT Model Download Consent Popup --- */
+.stt-consent-overlay {
+  position: fixed;
+  inset: 0;
+  background: rgba(0, 0, 0, 0.5);
+  backdrop-filter: blur(4px);
+  z-index: 10000;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  opacity: 0;
+  transition: opacity 0.25s ease;
+}
+
+.stt-consent-overlay.stt-consent-show {
+  opacity: 1;
+}
+
+.stt-consent-popup {
+  background: var(--bg-color, #fff);
+  border: 1px solid var(--border-color, #ddd);
+  border-radius: 16px;
+  box-shadow: 0 20px 60px rgba(0, 0, 0, 0.25);
+  max-width: 420px;
+  width: 90%;
+  overflow: hidden;
+  transform: scale(0.95);
+  transition: transform 0.25s ease;
+}
+
+.stt-consent-show .stt-consent-popup {
+  transform: scale(1);
+}
+
+[data-theme="dark"] .stt-consent-popup {
+  background: #1a1a2e;
+  border-color: #2d2d4a;
+  box-shadow: 0 20px 60px rgba(0, 0, 0, 0.6);
+}
+
+.stt-consent-header {
+  padding: 16px 20px;
+  font-size: 16px;
+  font-weight: 700;
+  color: var(--text-color, #333);
+  border-bottom: 1px solid var(--border-color, #eee);
+  display: flex;
+  align-items: center;
+  gap: 8px;
+}
+
+.stt-consent-header i {
+  color: #6366f1;
+  font-size: 18px;
+}
+
+[data-theme="dark"] .stt-consent-header {
+  border-bottom-color: #2d2d4a;
+}
+
+.stt-consent-body {
+  padding: 16px 20px;
+}
+
+.stt-consent-body p {
+  margin: 0 0 12px 0;
+  font-size: 13px;
+  color: var(--text-color, #555);
+  line-height: 1.5;
+}
+
+.stt-consent-info {
+  width: 100%;
+  border-collapse: collapse;
+  margin: 12px 0;
+}
+
+.stt-consent-info td {
+  padding: 6px 0;
+  font-size: 13px;
+  color: var(--text-color, #555);
+  border-bottom: 1px solid var(--border-color, #f0f0f0);
+}
+
+.stt-consent-info td:first-child {
+  width: 80px;
+  opacity: 0.6;
+  font-size: 12px;
+}
+
+[data-theme="dark"] .stt-consent-info td {
+  border-bottom-color: #2d2d4a;
+}
+
+.stt-consent-note {
+  font-size: 11px !important;
+  opacity: 0.6;
+  margin-top: 8px !important;
+}
+
+.stt-consent-actions {
+  padding: 12px 20px 16px;
+  display: flex;
+  justify-content: flex-end;
+  gap: 10px;
+}
+
+.stt-consent-cancel {
+  padding: 8px 18px;
+  border-radius: 8px;
+  border: 1px solid var(--border-color, #ddd);
+  background: transparent;
+  color: var(--text-color, #666);
+  font-size: 13px;
+  cursor: pointer;
+  transition: all 0.15s;
+}
+
+.stt-consent-cancel:hover {
+  background: var(--button-hover, #f5f5f5);
+}
+
+[data-theme="dark"] .stt-consent-cancel:hover {
+  background: #2d2d4a;
+}
+
+.stt-consent-download {
+  padding: 8px 20px;
+  border-radius: 8px;
+  border: none;
+  background: linear-gradient(135deg, #6366f1, #8b5cf6);
+  color: #fff;
+  font-size: 13px;
+  font-weight: 600;
+  cursor: pointer;
+  display: flex;
+  align-items: center;
+  gap: 6px;
+  transition: all 0.2s;
+  box-shadow: 0 2px 8px rgba(99, 102, 241, 0.3);
+}
+
+.stt-consent-download:hover {
+  background: linear-gradient(135deg, #4f46e5, #7c3aed);
+  box-shadow: 0 4px 12px rgba(99, 102, 241, 0.4);
+  transform: translateY(-1px);
+}
+
+@media (max-width: 480px) {
+  .stt-consent-popup {
+    width: 95%;
+    border-radius: 12px;
+  }
+
+  .stt-consent-actions {
+    flex-direction: column-reverse;
+  }
+
+  .stt-consent-download,
+  .stt-consent-cancel {
+    width: 100%;
+    justify-content: center;
+  }
 }
@@ -280,5 +280,20 @@
             downloadSize: '~80 MB',
         },
 
+        // ── Local: Voxtral Mini 3B STT (Speech-to-Text, WebGPU) ──
+        'voxtral-stt': {
+            label: 'Voxtral STT · Local',
+            badge: 'Voxtral STT · Local',
+            icon: 'bi bi-mic-fill',
+            statusReady: 'Voxtral Mini 3B STT · Local',
+            dropdownName: 'Voxtral STT (3B)',
+            dropdownDesc: 'Local · WebGPU · 13 Languages · ~2.7 GB',
+            isLocal: true,
+            isSttModel: true,
+            localModelId: 'textagent/Voxtral-Mini-3B-2507-ONNX',
+            downloadSize: '~2.7 GB',
+            requiresWebGPU: true,
+        },
+
     };
 })();
@@ -1,8 +1,9 @@
 // ============================================
-// whisper-worker.js — Whisper Large V3 Turbo ASR WebWorker
+// speech-worker.js — Whisper Large V3 Turbo ASR WebWorker (WASM fallback)
+// Used when WebGPU is NOT available. WebGPU devices use voxtral-worker.js.
 // Runs textagent/whisper-large-v3-turbo via @huggingface/transformers
 // off the main thread for jank-free transcription.
-// WER ~7.7% (batched) — significant upgrade over Moonshine Base (~9.66%)
+// WER ~7.7% (batched)
 // ============================================
 import { pipeline, env } from '@huggingface/transformers';
 
@@ -18,26 +19,11 @@ self.addEventListener('message', async (e) => {
 
     if (type === 'init') {
         try {
-            // Detect best available device
-            let device = 'wasm';
-            let dtype = 'q8';
-            if (typeof navigator !== 'undefined' && navigator.gpu) {
-                try {
-                    const adapter = await navigator.gpu.requestAdapter();
-                    if (adapter) {
-                        device = 'webgpu';
-                        dtype = 'fp16';  // WebGPU works best with fp16
-                        self.postMessage({ type: 'status', status: 'loading', message: '🚀 WebGPU available — using GPU acceleration' });
-                    }
-                } catch (_) { /* fall through to WASM */ }
-            }
-            if (device === 'wasm') {
-                self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo…' });
-            }
+            self.postMessage({ type: 'status', status: 'loading', message: '⏳ Downloading Whisper Large V3 Turbo (WASM)…' });
 
             const pipelineOpts = {
-                dtype,
-                device,
+                dtype: 'q8',
+                device: 'wasm',
                 progress_callback: (progress) => {
                     if (progress.status === 'progress') {
                         self.postMessage({
@@ -75,8 +61,9 @@ self.addEventListener('message', async (e) => {
             self.postMessage({
                 type: 'status',
                 status: 'ready',
-                message: 'Model ready',
-                device: device === 'webgpu' ? 'GPU (WebGPU)' : 'CPU (WASM)',
+                message: 'Whisper ready',
+                device: 'CPU (WASM)',
+                model: 'Whisper V3 Turbo',
             });
         } catch (err) {
             self.postMessage({ type: 'error', message: err.message || String(err) });