diff --git a/content/docs/hark/architecture.mdx b/content/docs/hark/architecture.mdx index 81c8b3a..898da73 100644 --- a/content/docs/hark/architecture.mdx +++ b/content/docs/hark/architecture.mdx @@ -59,9 +59,11 @@ Main (Dart) -> HarkMainApi (Kotlin) -> Overlay relay -> HarkOverlayFlu ## Wake word detection -Hark supports hands-free activation via the wake phrase "Hey Hark". Wake word detection runs on-device using [openWakeWord](https://github.com/dscripka/openWakeWord) (Apache 2.0) with ONNX Runtime for inference. A custom-trained model (201KB) listens continuously for the trigger phrase and auto-starts the microphone when detected. The wake word engine is integrated through Pigeon bindings in `hark_platform`, consistent with the rest of the native bridge. +Hark supports hands-free activation via the wake phrase "Hey Hark". Wake word detection runs on-device using [openWakeWord](https://github.com/dscripka/openWakeWord) (Apache 2.0) with ONNX Runtime for inference. A custom-trained 201KB model listens continuously for the trigger phrase. Detection runs inside a dedicated foreground `WakeWordService` (`FOREGROUND_SERVICE_TYPE_MICROPHONE`) with a persistent notification, so it survives app swipe-from-recents and OEM background kill behavior. The service is controlled through Pigeon bindings in `hark_platform`, consistent with the rest of the native bridge. -AudioRecord is a shared resource on Android, so the wake word listener and STT cannot run simultaneously. When the wake phrase is detected, the wake word engine stops its audio stream before STT begins listening. Once the voice command completes, the wake word engine restarts. This mutual exclusion is handled automatically in the platform plugin. The wake word engine persists even when the app is backgrounded, so detection continues across app switches. Background service support (Phase 2) will allow detection when the app is fully closed. +On detection, the service pauses the detector (releasing the mic for STT), then calls `HarkApplication.onWakeWordDetected()`, which invokes `HarkVoiceInteractionService.showSession()`. This is the system-sanctioned path for voice assistants to open their UI from the background — Android 12+'s normal background activity launch restrictions don't apply. That triggers `HarkSession.onShow()`, which starts `OverlayActivity`, and the overlay's `onOverlayOpened` callback auto-starts the mic. Say "Hey Hark" from any screen — including the lock screen — and the assistant panel appears with the mic already listening. + +AudioRecord is a shared resource on Android, so the wake word listener and STT cannot run simultaneously. After STT finishes, the Dart side sends a resume intent to the service, which restarts the detector. The notification has a **Stop** action that releases the mic without relaunching the app. ## Two-stage resolution diff --git a/content/docs/hark/wake-word.mdx b/content/docs/hark/wake-word.mdx index b138aff..b8687b0 100644 --- a/content/docs/hark/wake-word.mdx +++ b/content/docs/hark/wake-word.mdx @@ -1,58 +1,83 @@ --- title: Wake Word -last_edited: '2026-04-11T00:00:00.000Z' +last_edited: '2026-04-12T00:00:00.000Z' tocIsHidden: false --- -Hark supports "Hey Hark" wake word detection using openWakeWord, an open-source library (Apache 2.0) running entirely on-device. Say "Hey Hark" and the microphone activates automatically. +Hark supports "Hey Hark" wake word detection using openWakeWord, an open-source library (Apache 2.0) running entirely on-device. Say "Hey Hark" from any screen and the assistant panel appears with the mic already listening — no need to open the app first. The name "Hark" means "to listen." It's also my name. The coincidence was too good to ignore. ## How it works -The pipeline: AudioRecord (16kHz mono) feeds into Silero VAD (voice activity gate), then into openWakeWord (mel-spectrogram, speech embeddings, keyword detection), across the Pigeon bridge to ChatNotifier, which auto-starts the mic. +A foreground Service (`WakeWordService`) owns the detection pipeline. AudioRecord (16kHz mono) feeds into Silero VAD (voice activity gate), then into openWakeWord (mel-spectrogram, speech embeddings, keyword detection). On detection, the service pauses the detector (releasing the mic for STT), then calls `VoiceInteractionService.showSession()` — the system-sanctioned path that lets voice assistants open their UI from the background without tripping Android 12+'s background activity launch restrictions. That triggers `HarkSession.onShow()`, which launches `OverlayActivity`, and the overlay's open callback auto-starts the mic. Components: - **openWakeWord**: ONNX-based wake word engine. Runs a mel-spectrogram model and an embedding model to detect the keyword. - **Silero VAD**: Voice Activity Detection pre-filter. Saves battery by only running wake word inference when speech is detected. - **Custom model**: `hey_harkh.onnx` (201 KB), trained via the openWakeWord Google Colab notebook. +- **Foreground Service**: `WakeWordService` with `FOREGROUND_SERVICE_TYPE_MICROPHONE`, persistent notification, and `START_STICKY` so Android restarts it if the process is killed. +- **VoiceInteractionService**: `HarkVoiceInteractionService.showSession()` is the cross-background overlay launcher. Works from any screen, including when the user has swiped Hark away from Recents. ## Architecture ``` -AudioRecord (16kHz) +AudioRecord (16kHz, inside WakeWordService) → Silero VAD (is someone speaking?) → openWakeWord engine (is it "Hey Hark"?) → WakeWordDetector.kt (Kotlin wrapper) - → HarkPlatformPlugin (Pigeon bridge) - → HarkResultFlutterApi.onWakeWordDetected() - → ChatNotifier (auto-starts microphone) + → WakeWordService.onWakeWordDetected() + ├─→ detector.pause() (release mic for STT) + ├─→ HarkApplication.onWakeWordDetected() + │ → HarkVoiceInteractionService.showSession() + │ → HarkSession.onShow() → startActivity(OverlayActivity) + │ → OverlayBridgeService.onOverlayOpened() → ChatNotifier.onMicPressed() + └─→ HarkResultFlutterApi.onWakeWordDetected() (notify Dart for logging) ``` Detection threshold: 0.3. Cooldown: 1500ms between detections. ## AudioRecord mutual exclusion -Android only allows one AudioRecord at a time. Wake word detection and speech recognition (SpeechRecognizer) both need the mic. Hark handles this by: +Android only allows one AudioRecord at a time. Wake word detection and speech recognition (SpeechRecognizer) both need the mic. The service handles this by: -1. Wake word detected. Pause wake word engine (fully stops AudioRecord). -2. STT starts. Speech is transcribed. -3. STT finishes. Resume wake word engine (restarts AudioRecord). +1. Wake word detected → pause wake word engine (fully stops AudioRecord). +2. STT starts and takes the mic. +3. STT finishes → Dart side calls `setWakeWordPaused(false)`, which sends an `ACTION_RESUME` intent to the service, which restarts the detector. -The pause/resume cycle causes a roughly 25-second buffer rebuild delay. openWakeWord needs about 10 seconds of audio context in its embedding buffer before it can reliably detect again. This is a known limitation being addressed. +The pause/resume cycle causes a roughly 25-second buffer rebuild delay. openWakeWord needs about 10 seconds of audio context in its embedding buffer before it can reliably detect again. This is a known limitation being investigated. -## Current status: Phase 1 (shipped) +## User controls -In-app wake word detection is working. When Hark is open, "Hey Hark" activates the mic. +The foreground notification is always visible while wake word is running. It shows: -## Phase 2 (planned): Background service +- **Small icon**: Monochrome Hark robot silhouette (required by Android's small-icon tinting rules). +- **Title**: "Hark". +- **Text**: "Listening for 'Hey Hark'". +- **Tap**: Opens the main chat screen. +- **Stop action**: Sends `ACTION_STOP` to the service, which releases the mic and removes the notification. Returns `START_NOT_STICKY` so Android does not auto-restart after an explicit stop. -A foreground service running wake word detection even when Hark is closed. "Hey Hark" from any screen would launch the overlay and start listening. +Permissions required on Android 13+: -## Phase 3 (planned): Continuous listening session +- `RECORD_AUDIO` (runtime, requested by the chat screen) +- `POST_NOTIFICATIONS` (runtime, requested at init so the FG notification is visible) +- `FOREGROUND_SERVICE` and `FOREGROUND_SERVICE_MICROPHONE` (manifest) -After "Hey Hark" starts a session, the mic stays open for the entire conversation. No need to say the wake word again. The user can interrupt while Hark is speaking (barge-in). This requires acoustic echo cancellation to prevent Hark from hearing its own TTS output. +## Lifecycle + +- The service starts from `ChatNotifier._initAsync()` via `startWakeWordService()` in the Pigeon plugin, which sends an `ACTION_START` intent. +- Android's `VoiceInteractionService` system binding keeps the process alive indefinitely, even when the user swipes Hark from Recents. +- `START_STICKY` is an additional safety net: if the OS does kill the service for some reason, it will be restarted with a null intent, which the service handles by re-starting detection. +- When the user taps **Stop** on the notification, the service releases the detector, removes the notification, and returns `START_NOT_STICKY` so Android does not resurrect it. + +## What's next + +- **Sensitivity slider** exposed via a Settings screen. +- **Privacy indicator** in the app when the mic is hot. +- **Battery impact measurement** on Moto G56 and one other mid-range device. +- **Buffer-rebuild shortening** for the ~25s recovery after STT. +- **Barge-in**: interrupt Hark's TTS mid-sentence. Requires acoustic echo cancellation research. ## Dependencies diff --git a/content/docs/roadmap.mdx b/content/docs/roadmap.mdx index d66d724..221d60b 100644 --- a/content/docs/roadmap.mdx +++ b/content/docs/roadmap.mdx @@ -1,6 +1,6 @@ --- title: Roadmap -last_edited: '2026-04-11T00:00:00.000Z' +last_edited: '2026-04-12T00:00:00.000Z' tocIsHidden: false --- @@ -25,26 +25,33 @@ Other SDKs are not documented as installable products until there is code to ins ### Shipped - On-device discovery via `ContentProvider` scanning -- Two-stage NLU: EmbeddingGemma 308M for intent selection, Qwen3 0.6B for slot filling +- Two-stage NLU: EmbeddingGemma 308M for intent selection, Qwen3 0.5B for slot filling - Foreground and broadcast dispatch paths with async result correlation - Registration as Android system assistant (`VoiceInteractionService`, `ROLE_ASSISTANT`) - Continuous listening mode after assistant-gesture activation -- Lightweight overlay via FlutterEngineGroup with thin UI shell (zero model loading, instant startup) -- hark_platform plugin with Pigeon for type-safe platform communication -- Tested integrations with the OACP Test App demo, Breezy Weather, Binary Eye, Voice Recorder, Wikipedia, ArchiveTune +- Lightweight overlay via `FlutterEngineGroup` with thin UI shell (zero model loading, instant startup) +- `hark_platform` plugin with Pigeon for type-safe platform communication +- **"Hey Hark" wake word** using openWakeWord with ONNX Runtime and a custom-trained 201 KB model +- **Wake word foreground service** with persistent notification, Stop action, and `START_STICKY` restart — detection survives swipe-from-recents +- **Overlay launch on detection** via `VoiceInteractionService.showSession()` — say "Hey Hark" from any screen and the assistant panel appears +- **Lifecycle capability refresh** so uninstalled OACP apps drop out of the registry on app resume without a restart +- Tested integrations with the OACP Test App demo, Breezy Weather, Binary Eye, Voice Recorder, Libre Camera, Wikipedia, ArchiveTune ### In progress -- **Wake word** - "Hey Hark" on-device detection using openWakeWord with ONNX Runtime. Phase 1 (in-app listening, auto-mic activation) is working. Phase 2 (background service for detection when the app is fully closed) is next. -- **Self-hosted inference** - connect Hark to Ollama or LM Studio on a local network for unlimited context and `OACP.md` consumption -- **BYOK cloud fallback** - bring-your-own-key for OpenAI, Gemini, Anthropic when no local model is available -- **Better STT** - evaluate whisper.cpp and sherpa-onnx for fully on-device speech recognition -- **Disambiguation UI** - show top candidates as chips when scores are close and learn from user feedback +- **Settings screen** — permissions, wake word toggle, model info, about +- **Action chips and disambiguation** — tappable chips in chat bubbles, disambiguation buttons when top scores are close +- **Wake word polish** — sensitivity slider, privacy indicator, battery measurement, buffer-rebuild shortening +- **Better STT** — evaluate whisper.cpp and sherpa-onnx for fully on-device speech recognition +- **Release packaging** — proper release signing, GitHub Releases, F-Droid submission -### Planned -- **Personalization** - assistant learns user preferences and per-app vocabulary over time -- **Gemma 4 single-model pipeline** - once `flutter_gemma` supports it, collapse the two-stage pipeline into a single call -- **iOS** - exploring feasibility +### Deferred + +- **Self-hosted inference** (Ollama, LM Studio) — the two-stage local pipeline is good enough for current capabilities. Revisit once a real quality ceiling appears. +- **BYOK cloud** — defer until self-hosted lands or users demand cloud. +- **Gemma 4 single-model pipeline** — waiting on `flutter_gemma` support and a clear win over the two-stage stack. +- **Barge-in** (interrupt Hark's TTS mid-sentence) — requires acoustic echo cancellation research. +- **iOS** — exploring feasibility. ## Ecosystem