From 77c36e1c2884d39ac07fe3cd67bf12ef17669386 Mon Sep 17 00:00:00 2001 From: Tyler Stapler Date: Sat, 9 May 2026 16:00:29 -0700 Subject: [PATCH 1/3] =?UTF-8?q?Add=20feature=20research=20backlog=20?= =?UTF-8?q?=E2=80=94=2031=20items=20across=208=20reference=20projects?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surveys voxtype, BlahST, Handy, Meetily, WhisperKit, Hex, noScribe, and WhisperWriter to produce a prioritized feature backlog (11 High / 12 Medium / 8 Low) covering: push-to-talk/dictation mode, additional transcription engines (Parakeet, Moonshine, SenseVoice), LLM integration patterns, and export formats (SRT, VTT, JSON). All items include attribution notes for the README. Co-Authored-By: Claude Sonnet 4.6 --- .../implementation/plan.md | 360 ++++++++++++++++++ .../implementation/validation.md | 112 ++++++ .../agrapha-feature-research/requirements.md | 56 +++ .../research/blahst.md | 72 ++++ .../research/comparable-projects.md | 324 ++++++++++++++++ .../research/handy.md | 81 ++++ .../research/voxtype.md | 55 +++ 7 files changed, 1060 insertions(+) create mode 100644 project_plans/agrapha-feature-research/implementation/plan.md create mode 100644 project_plans/agrapha-feature-research/implementation/validation.md create mode 100644 project_plans/agrapha-feature-research/requirements.md create mode 100644 project_plans/agrapha-feature-research/research/blahst.md create mode 100644 project_plans/agrapha-feature-research/research/comparable-projects.md create mode 100644 project_plans/agrapha-feature-research/research/handy.md create mode 100644 project_plans/agrapha-feature-research/research/voxtype.md diff --git a/project_plans/agrapha-feature-research/implementation/plan.md b/project_plans/agrapha-feature-research/implementation/plan.md new file mode 100644 index 0000000..f720493 --- /dev/null +++ b/project_plans/agrapha-feature-research/implementation/plan.md @@ -0,0 +1,360 @@ +# Agrapha Feature Backlog + +Prioritized feature backlog derived from research into VoxType, BlahST, Handy, Meetily, OpenWhispr, WhisperWriter, open-wispr, and whisper-mac. Items are ordered High → Medium → Low within each section. + +--- + +## HIGH PRIORITY + +--- + +## Feature: Custom Vocabulary / Dictionary Injection +**Priority:** High +**Inspired by:** [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) +**What they do:** Handy accepts a `custom_words` list that is injected as Whisper's `initial_prompt` parameter and as a Parakeet custom vocabulary, with fuzzy-match post-correction. WhisperWriter exposes `initial_prompt` directly as a config field for domain conditioning. +**What Agrapha would do:** Allow users to define a persistent list of names, project codes, and technical terms; inject them as Whisper's `initial_prompt` via the existing JNI bridge so beam search favors those tokens, with optional fuzzy-match correction post-transcription. +**Attribution note (README):** Custom vocabulary / dictionary injection pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Effort estimate:** S +**Notes:** `WhisperParams` in Agrapha's JNI bridge already has an `initial_prompt` field — this is mostly UI + persistence (SQLDelight) work. Fuzzy-match post-correction is optional and can ship in a follow-up. + +--- + +## Feature: Filler Word Stripping +**Priority:** High +**Inspired by:** [Handy](https://github.com/cjpais/Handy) +**What they do:** Handy strips a configurable list of filler words (e.g., "um", "uh", "you know") from transcription output as a post-processing step. +**What Agrapha would do:** Apply a configurable regex/list replacement pass after transcription to remove common filler words from meeting transcripts, producing cleaner minutes and summaries. +**Attribution note (README):** Filler word stripping pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** XS +**Notes:** Pure Kotlin string post-processing; zero JNI changes needed. Default list should include "um", "uh", "like", "you know", "sort of". Let users extend the list. + +--- + +## Feature: Audio Feedback on Recording Start / Stop +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy) +**What they do:** VoxType plays themed WAV sounds on recording start, stop, and error. Handy exposes `audio_feedback: bool` and `audio_feedback_volume: f32` with a `SoundTheme` enum. +**What Agrapha would do:** Play a short system sound or bundled audio clip when meeting recording starts and stops so users get eyes-free confirmation of state changes. +**Attribution note (README):** Audio feedback on recording start/stop inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** XS +**Notes:** macOS provides `NSSound` and `AudioServicesPlaySystemSound`; accessible from Kotlin via JNA or a 10-line Swift/ObjC helper. Bundle 2–3 sounds (start, stop, error). Volume knob in settings. + +--- + +## Feature: Toggle vs Push-to-Talk Recording Modes +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) +**What they do:** All three expose a boolean toggle between push-to-talk (hold key → recording, release → transcribe) and toggle modes (press once to start, press again to stop). WhisperWriter additionally offers a `continuous` mode (auto-restart after each segment) and a `voice_activity_detection` mode. +**What Agrapha would do:** Add a `RecordingMode` enum (`MEETING_CONTINUOUS`, `TOGGLE`, `PUSH_TO_TALK`) to the settings UI. Meeting mode remains the default; toggle and push-to-talk are available for dictation use cases. VAD-based auto-stop can be a follow-up. +**Attribution note (README):** Toggle and push-to-talk recording mode design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Effort estimate:** S +**Notes:** Requires a global hotkey listener on macOS (see Global Hotkey / Dictation Mode feature). The mode enum should be persisted in the existing settings store. + +--- + +## Feature: Global Hotkey / Dictation Mode (Paste to Active App) +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [open-wispr](https://github.com/human37/open-wispr), [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** All four projects allow a global keyboard shortcut to trigger recording from any app, then paste the transcription result at the keyboard caret — Handy via `rdev` keyboard injection or clipboard+Ctrl+V, open-wispr via `AXUIElement`/`CGEvent`, BlahST via xdotool/ydotool. +**What Agrapha would do:** Register a configurable global hotkey via macOS `CGEventTap` or `addGlobalMonitorForEvents` (JNA or Swift bridge), record until key release or toggle-off, transcribe, then inject text into the frontmost app via the macOS Accessibility API (`AXUIElement`) with a Ctrl+V clipboard fallback. +**Attribution note (README):** Global hotkey dictation mode and macOS accessibility-API paste design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), [open-wispr](https://github.com/human37/open-wispr) (MIT), and [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** M +**Notes:** Requires `com.apple.security.automation.apple-events` and Accessibility permission in macOS entitlements. JNA can call the C-level `CGEventTap` APIs directly without a Swift bridge. This is a foundational dependency for push-to-talk and toggle modes. + +--- + +## Feature: Previous-Chunk Text Conditioning +**Priority:** High +**Inspired by:** [WhisperWriter](https://github.com/savbell/whisper-writer) +**What they do:** WhisperWriter passes the previous transcription chunk's text as Whisper's `initial_prompt` for the next chunk, reducing repetition artifacts and improving coherence across segment boundaries in continuous recordings. +**What Agrapha would do:** In meeting (continuous) mode, automatically carry forward the last N words of the previous transcription segment as the Whisper `initial_prompt` for the next segment, improving transcript coherence without any user action. +**Attribution note (README):** Previous-chunk text conditioning for continuous transcription inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Effort estimate:** XS +**Notes:** One-line change in the transcription loop to set `initial_prompt = last_segment_tail`. Synergizes with the custom vocabulary feature (both write to `initial_prompt`; concatenate both). Cap at ~224 tokens to stay within Whisper's context window. + +--- + +## Feature: SRT and VTT Export +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [WhisperKit](https://github.com/argmaxinc/argmax-oss-swift) +**What they do:** VoxType's meeting mode exports transcriptions to SRT and VTT subtitle formats with per-segment timestamps in addition to Markdown and plain text. WhisperKit has native SRT/VTT export as a first-class feature, documenting the word-timestamp → segment-grouping → subtitle-formatting data model explicitly. +**What Agrapha would do:** Add SRT and VTT as export options alongside the existing Markdown/Logseq export, using Agrapha's existing segment timestamps to generate standard subtitle files consumable by video editors (DaVinci Resolve, Final Cut Pro, Premiere). +**Attribution note (README):** SRT and VTT export format design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT); subtitle data model (word timestamps → segment grouping) noted from [WhisperKit](https://github.com/argmaxinc/argmax-oss-swift) (MIT). +**Effort estimate:** S +**Notes:** SRT and VTT are text formats with straightforward timestamp serialization. Agrapha already stores per-segment timestamps in SQLDelight; the serializer is ~100 lines of Kotlin. No JNI changes required. + +--- + +## Feature: JSON Export +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype) +**What they do:** VoxType exports meeting transcriptions to structured JSON alongside Markdown and subtitle formats, enabling downstream automation and integration with other tools. +**What Agrapha would do:** Export a structured JSON file containing the full transcript with per-segment timestamps, speaker labels (when diarization is enabled), LLM summaries, action items, and decisions — providing a machine-readable format for downstream integrations. +**Attribution note (README):** JSON export format design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT). +**Effort estimate:** XS +**Notes:** Kotlinx.serialization already in the project. Schema should mirror the existing data model: `{ meeting_id, started_at, segments: [{ speaker, start_ms, end_ms, text }], summary, action_items, decisions }`. Pair with SRT/VTT export in a single "Export Formats" milestone. + +--- + +## Feature: Whisper Model Auto-Unload on Idle +**Priority:** High +**Inspired by:** [Handy](https://github.com/cjpais/Handy) +**What they do:** Handy auto-unloads the Whisper model after a configurable idle timeout using a watcher thread and RAII `LoadingGuard`, freeing 1.5–3 GB of RAM when recording is not active. +**What Agrapha would do:** Add a configurable idle-unload timeout (default: 5 minutes) to Agrapha's Whisper JNI wrapper; after the timeout expires with no transcription activity, free the native model from memory automatically. +**Attribution note (README):** Model idle-unload pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** S +**Notes:** Requires a Kotlin `CoroutineScope` + `delay`-based watcher (or `ScheduledExecutorService`) in the transcription manager. The JNI `freeContext()` call already exists; this is plumbing around it. Reload on next recording start with a brief UI indicator. + +--- + +## Feature: Multiple Named LLM Post-Processing Prompts +**Priority:** High +**Inspired by:** [Handy](https://github.com/cjpais/Handy), [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** Handy stores multiple named `LLMPrompt` presets selectable per transcription. BlahST dispatches to different system prompts based on spoken keywords ("Summarise…", "Draft email…", "Proofread…"). +**What Agrapha would do:** Allow users to define named post-processing prompts beyond the built-in "summary" — e.g., "Action items only", "Draft follow-up email", "Extract decisions" — selectable from a dropdown before or after transcription ends. +**Attribution note (README):** Multiple named LLM post-processing prompts inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** S +**Notes:** Agrapha already has an LLM integration layer; this is UI + prompt storage (SQLDelight). Ship 3 built-in presets (Summary, Action Items, Follow-up Email) and let users add custom prompts. Store `post_process_prompt` alongside `post_processed_text` in the existing meetings table. + +--- + +## MEDIUM PRIORITY + +--- + +## Feature: Meeting App Auto-Detection (Auto-Start Recording) +**Priority:** Medium +**Inspired by:** [OpenWhispr](https://github.com/OpenWhispr/openwhispr) +**What they do:** OpenWhispr detects when Zoom, Teams, or FaceTime becomes the active window and automatically starts recording, eliminating the need to manually trigger a meeting session. +**What Agrapha would do:** Poll the macOS window server (via `CGWindowListCopyWindowInfo` or `NSWorkspace.shared.frontmostApplication`) for known video-call app bundle IDs (Zoom, Teams, Google Meet, FaceTime, Webex) and auto-start a recording session when one becomes active, with a configurable allow-list. +**Attribution note (README):** Auto-detection of active video-call applications to trigger recording inspired by [OpenWhispr](https://github.com/OpenWhispr/openwhispr) (MIT). +**Effort estimate:** M +**Notes:** `NSWorkspace.didActivateApplicationNotification` provides the hook; JNA can receive it. Privacy consideration: show a notification before auto-starting so users know recording has begun. Add an opt-in toggle in settings (off by default). + +--- + +## Feature: Parakeet ONNX Engine +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [Hex](https://github.com/kitlangton/Hex) +**What they do:** VoxType and Handy support NVIDIA Parakeet (FastConformer TDT) via ONNX Runtime as an alternative to Whisper, offering ~5× real-time throughput on CPU with comparable English accuracy and no GPU required. Meetily also supports Parakeet models (implementation details not confirmed in available source). Hex provides a production Swift implementation of dual-engine Parakeet+Whisper switching with a user-facing toggle on macOS (Apple Silicon). +**What Agrapha would do:** Add Parakeet as a selectable transcription engine via ONNX Runtime for Java (onnxruntime-java), allowing users on lower-power Macs or with large meeting backlogs to transcribe faster without waiting for Whisper large-v3. +**Attribution note (README):** Parakeet ONNX engine integration pattern inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [Handy](https://github.com/cjpais/Handy) (MIT); multi-engine toggle design pattern from [Hex](https://github.com/kitlangton/Hex) (MIT). +**Effort estimate:** L +**Notes:** ONNX Runtime has an official Java API (`com.microsoft.onnxruntime:onnxruntime`). Parakeet is English-only; diarization integration needs re-validation. Model download (~500 MB) must be handled gracefully. Abstract a `TranscriptionEngine` interface first so Whisper and Parakeet share a common caller. + +--- + +## Feature: Silero VAD (Voice Activity Detection) +**Priority:** Medium +**Inspired by:** [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) +**What they do:** Handy uses a `SmoothedVad` wrapper over Silero VAD to trim leading/trailing silence from each audio chunk before sending to Whisper, reducing hallucinations and inference latency. WhisperWriter makes the silence duration configurable. +**What Agrapha would do:** Integrate Silero VAD (ONNX model, ~1 MB) via ONNX Runtime for Java to detect and trim silence from each meeting audio chunk before Whisper inference, reducing hallucination artifacts and improving transcription of meeting segments with long pauses. +**Attribution note (README):** Silero VAD integration for silence trimming inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Effort estimate:** M +**Notes:** Silero VAD ONNX model is ~1 MB; inference is CPU-only and fast. Requires ONNX Runtime dependency (shared with Parakeet if that ships first). For meeting mode, VAD primarily reduces hallucination on silence; for dictation mode, it enables auto-stop. Both use cases justify the dependency. + +--- + +## Feature: Transcription History with Saved-Favourite Flag +**Priority:** Medium +**Inspired by:** [Handy](https://github.com/cjpais/Handy) +**What they do:** Handy stores every transcription in SQLite with fields for raw text, post-processed text, the prompt used, a `saved` boolean flag for user-marked favourites, a title, and a configurable retention period; a Raycast extension browses this history. +**What Agrapha would do:** Extend the existing SQLDelight meetings schema with a `saved` flag, a `post_processed_text` column, and a `post_process_prompt` column; add a searchable history view in the Compose Desktop UI so users can browse, star, and re-export past meetings. +**Attribution note (README):** Transcription history schema with saved-favourite flag inspired by [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** M +**Notes:** Schema migration is straightforward with SQLDelight. The history view should support search by keyword and filter by date range. The `saved` flag prevents a meeting from being purged by the retention policy. + +--- + +## Feature: Re-Transcribe with Different Model +**Priority:** Medium +**Inspired by:** [Meetily](https://github.com/Zackriya-Solutions/meetily) +**What they do:** Meetily (Beta) allows importing an existing audio recording and re-transcribing it with a different Whisper model size or language setting, useful when a larger or newer model becomes available. +**What Agrapha would do:** Expose a "Re-transcribe" action on completed meetings that re-runs Whisper (or another engine) against the stored audio file with a selectable model and language, replacing or appending the new transcript while preserving the original. +**Attribution note (README):** Re-transcription with model selection inspired by [Meetily](https://github.com/Zackriya-Solutions/meetily) (MIT). +**Effort estimate:** M +**Notes:** Requires Agrapha to retain the raw audio file after a meeting (currently unclear if it does). If audio is discarded, this feature requires a storage policy change first. Preserve the original transcript as a separate version; do not overwrite. + +--- + +## Feature: Apple Intelligence On-Device Post-Processing +**Priority:** Medium +**Inspired by:** [Handy](https://github.com/cjpais/Handy) +**What they do:** Handy calls Apple Intelligence via a Swift FFI (`extern "C"` bridge) to apply on-device LLM text processing without any API key, checking availability at runtime (`isAppleIntelligenceAvailable()`), and supports M1+ Macs on macOS Sequoia 15.1+. +**What Agrapha would do:** Add Apple Intelligence as an optional LLM provider for transcript correction and summarization, available at no cost on qualifying Macs (M1+, macOS 15.1+), via a Kotlin → JNI → Swift bridge using the same `process_text_with_system_prompt_apple()` pattern. +**Attribution note (README):** Apple Intelligence on-device LLM post-processing integration pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** L +**Notes:** Requires adding a Swift/ObjC compilation step to the Gradle build (non-trivial for a Kotlin-first project). Runtime availability check is mandatory — must degrade gracefully on older hardware. Highest value for privacy-conscious users who want zero cloud calls. Gate behind a feature flag. + +--- + +## Feature: Remote Whisper Inference Endpoint +**Priority:** Medium +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** VoxType supports a remote whisper.cpp HTTP server as a backend (`--engine whisper-remote`). BlahST's `-n` flag routes to a LAN server for offload inference, with sub-150 ms round-trip. +**What Agrapha would do:** Allow users to configure a remote whisper.cpp server URL (e.g., a faster Mac on the same LAN) as an alternative to local inference, reducing battery drain on the meeting device and enabling access to larger Whisper models without local RAM constraints. +**Attribution note (README):** Remote whisper.cpp server endpoint design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** S +**Notes:** whisper.cpp ships a simple HTTP server mode. Agrapha's JNI layer would be bypassed; audio chunks POSTed to the server endpoint instead. Use `ktor-client` (already likely in the stack) for the HTTP call. Requires LAN latency consideration for chunked meeting transcription. + +--- + +## Feature: One-Shot Speech-to-LLM (Dictated Question → AI Answer) +**Priority:** Medium +**Inspired by:** [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** BlahST's `wsiAI` script captures a spoken question, transcribes it, sends it to a local llama-server with a system prompt, receives a text response, speaks it via Piper TTS, and places the answer in the clipboard. +**What Agrapha would do:** Add a "Quick Ask" mode accessible from the menu bar: press a hotkey, dictate a question, and Agrapha transcribes and routes it to the configured LLM (Ollama/OpenAI/Anthropic), displaying the answer in a floating window with a "Copy" button. +**Attribution note (README):** One-shot speech-to-LLM assistant mode inspired by [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** M +**Notes:** Depends on the Global Hotkey / Dictation Mode feature. The LLM call reuses Agrapha's existing provider abstractions. The floating window is a new Compose Desktop surface. TTS response (via macOS `say` or a local model) is optional and can ship separately. + +--- + +## Feature: macOS Menu Bar Recording Status Indicator +**Priority:** Medium +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype) +**What they do:** VoxType emits a live JSON status feed (`voxtype status --follow --format json`) consumed by Waybar/polybar to show recording state, active model, and backend in the system bar at all times. +**What Agrapha would do:** Add a persistent macOS menu bar extra (NSStatusItem) showing Agrapha's current state (idle, recording, transcribing) with a simple icon, giving users always-visible recording confirmation without needing to keep the main window open. +**Attribution note (README):** Always-visible recording status indicator concept inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT). +**Effort estimate:** M +**Notes:** Compose Desktop does not natively support `NSStatusItem`; requires a JNA call or a small Swift helper. The AWT SystemTray API is cross-platform but renders poorly on macOS. Recommend a thin Swift/ObjC `NSStatusItem` bridge called from Kotlin via JNI. Useful for dictation mode regardless of meeting context. + +--- + +## Feature: Semantic Search over Transcripts +**Priority:** Medium +**Inspired by:** [OpenWhispr](https://github.com/OpenWhispr/openwhispr) +**What they do:** OpenWhispr embeds transcript text using a local embedding model and supports semantic (vector similarity) search over the notes and transcript history — enabling queries like "find the meeting where we discussed pricing" rather than exact keyword search. +**What Agrapha would do:** Generate embeddings for each meeting transcript (using a local sentence-transformers-compatible model via ONNX) and store them in SQLDelight or a lightweight vector store, enabling semantic search from the history view as a complement to keyword search. +**Attribution note (README):** Semantic search over transcripts inspired by [OpenWhispr](https://github.com/OpenWhispr/openwhispr) (MIT). +**Effort estimate:** L +**Notes:** Embedding model (~80 MB, e.g., all-MiniLM-L6-v2 ONNX) is the main download cost. SQLite supports basic vector operations via the sqlite-vec extension or a file-based FAISS index. High value for the "memory system" narrative but L effort; defer until transcript history is shipped. + +--- + +## Feature: Voice Fingerprinting Across Sessions +**Priority:** Medium +**Inspired by:** [OpenWhispr](https://github.com/OpenWhispr/openwhispr) +**What they do:** OpenWhispr builds persistent voice fingerprints for speakers across multiple meetings, so "Alice from Acme" is automatically identified in future recordings rather than appearing as an anonymous "Speaker 2." +**What Agrapha would do:** Extend Agrapha's existing pyannote.audio diarization to persist speaker embeddings in a SQLDelight speaker table, allowing the user to name speakers once and have them auto-labeled in future meetings with the same participants. +**Attribution note (README):** Persistent voice fingerprinting across sessions inspired by [OpenWhispr](https://github.com/OpenWhispr/openwhispr) (MIT). +**Effort estimate:** L +**Notes:** pyannote.audio already produces speaker embeddings; storing and comparing them is the new work. Requires a cosine similarity threshold tuned to avoid false matches. Privacy-sensitive: speaker profiles should be opt-in and stored locally only. + +--- + +## Feature: macOS Native Speech Framework Engine (SFSpeechRecognizer) +**Priority:** Medium +**Inspired by:** [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac) +**What they do:** whisper-mac lists macOS native `SFSpeechRecognizer` as one of its supported local transcription backends — zero model download, built into every Mac since macOS 10.15, ~100–200 ms latency for short utterances. +**What Agrapha would do:** Offer `SFSpeechRecognizer` as a lightweight "no-download" engine option for users who haven't yet downloaded a Whisper model, providing an immediate out-of-box experience; surface the `requiresOnDeviceRecognition` privacy flag prominently. +**Attribution note (README):** No external attribution — `SFSpeechRecognizer` is an Apple-documented API; whisper-mac (license unspecified) treated as inspiration only, no attribution. +**Effort estimate:** M +**Notes:** `SFSpeechRecognizer` is Swift/ObjC only; requires a JNI bridge. `requiresOnDeviceRecognition = true` must be set for privacy compliance — document clearly that without it audio is sent to Apple servers. Quality is lower than Whisper small for technical vocabulary; position as a "quick start" engine, not a meeting-quality engine. + +--- + +## Feature: Inline Transcript Correction Editor +**Priority:** Medium +**Inspired by:** [noScribe](https://github.com/kaixxx/noScribe) +**What they do:** noScribe ships a dedicated transcript editor (noScribeEdit) for manual correction of whisper output — word-level editing with speaker labels, confidence highlights, and playback sync. +**What Agrapha would do:** An editable transcript view in the meeting summary screen, allowing users to fix misheard words, correct speaker labels, and re-run summarization against the corrected text before exporting. +**Attribution note (README):** Transcript correction editor design inspired by [noScribe](https://github.com/kaixxx/noScribe) (GPL-3.0; patterns only — no code reuse). +**Effort estimate:** L +**Notes:** noScribe is GPL-3.0 — architecture inspiration only, no code can be incorporated. Requires extending SQLDelight schema to store corrected transcript alongside raw whisper output. + +--- + +## LOW PRIORITY + +--- + +## Feature: Spoken Punctuation Post-Processing +**Priority:** Low +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype) +**What they do:** VoxType converts spoken punctuation words ("period" → ".", "open paren" → "(", "new line" → "\n") via a configurable replacement table, designed primarily for developer/technical dictation workflows. +**What Agrapha would do:** Apply an optional post-processing pass that converts a configurable list of spoken words to punctuation or special characters, useful for Agrapha users who dictate code or structured text outside of meeting context. +**Attribution note (README):** Spoken punctuation conversion pattern inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT). +**Effort estimate:** XS +**Notes:** Low value for meeting transcription (the primary use case); most relevant if Agrapha ships a developer-focused dictation mode. Implement as a simple ordered replacement list in settings. Low effort but niche audience. + +--- + +## Feature: CLI Remote Control Flags +**Priority:** Low +**Inspired by:** [Handy](https://github.com/cjpais/Handy) +**What they do:** Handy exposes `--toggle-transcription`, `--toggle-post-process`, and `--cancel` CLI flags that send IPC commands to a running Handy instance, enabling integration with Raycast, Alfred, macOS Shortcuts, and window managers. +**What Agrapha would do:** Expose `agrapha record start|stop|toggle` and `agrapha export` CLI subcommands via a Unix socket or named pipe to the running Agrapha instance, enabling Raycast, Alfred, and macOS Shortcuts integration without opening the main window. +**Attribution note (README):** CLI remote control flag design inspired by [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** M +**Notes:** Requires implementing a lightweight IPC server in Agrapha's process (Unix domain socket is cleanest on macOS). High value for power users but low urgency compared to core transcription improvements. Depends on the menu bar indicator being present so users know recording state. + +--- + +## Feature: SortFormer Real-Time Diarization +**Priority:** Low +**Inspired by:** [Meetily](https://github.com/Zackriya-Solutions/meetily) +**What they do:** Meetily uses NVIDIA's SortFormer (ONNX) for real-time on-device speaker diarization, which claims better accuracy than pyannote.audio on live streams and runs fully locally via ONNX Runtime. +**What Agrapha would do:** Evaluate SortFormer as an alternative or supplement to the existing pyannote.audio diarization pipeline, potentially replacing the Python dependency with a pure ONNX Runtime call accessible from the JVM. +**Attribution note (README):** SortFormer ONNX diarization evaluation inspired by [Meetily](https://github.com/Zackriya-Solutions/meetily) (MIT). +**Effort estimate:** XL +**Notes:** Eliminates the pyannote.audio Python dependency (a significant operational improvement) but requires porting the diarization pipeline to ONNX Runtime for Java, which is a large engineering effort. Priority rises significantly if pyannote.audio causes installation pain. Treat as a research spike first. + +--- + +## Feature: MCP Server for Transcript History +**Priority:** Low +**Inspired by:** [OpenWhispr](https://github.com/OpenWhispr/openwhispr) +**What they do:** OpenWhispr exposes a public MCP server that lets Claude Desktop, Cursor, and other MCP-aware AI tools query the user's notes and transcription history programmatically. +**What Agrapha would do:** Expose Agrapha's meeting transcript and summary data via a local MCP server, allowing Claude Desktop or other MCP clients to query "what did we decide in the pricing meeting last week?" against Agrapha's local database. +**Attribution note (README):** MCP server for transcript history access inspired by [OpenWhispr](https://github.com/OpenWhispr/openwhispr) (MIT). +**Effort estimate:** M +**Notes:** MCP protocol is JSON-RPC over stdio or HTTP. The transport layer is simple; the value is in defining a useful schema for meeting data (meetings, segments, speakers, summaries, action items). Highly differentiating for the "memory system" narrative but requires users to already have an MCP-aware client. + +--- + +## Feature: Moonshine Engine (Edge-Optimised ONNX) +**Priority:** Low +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy) +**What they do:** VoxType and Handy both support Moonshine (encoder-decoder ONNX, English, edge-optimised) as a transcription engine, offering lower memory usage than Whisper small with comparable English accuracy on short utterances. +**What Agrapha would do:** Add Moonshine as a third selectable engine via ONNX Runtime for Java, offering a smaller download footprint (~100 MB vs Whisper large's 3 GB) for users who only need English transcription. +**Attribution note (README):** Moonshine ONNX engine integration pattern inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [Handy](https://github.com/cjpais/Handy) (MIT). +**Effort estimate:** M +**Notes:** Moonshine is English-only and optimised for short utterances — less suited to long meeting recordings than Whisper or Parakeet. Best delivered after the Parakeet engine is shipped and the `TranscriptionEngine` abstraction is in place. Defers naturally to that milestone. + +--- + +## Feature: Local TTS for Spoken AI Responses +**Priority:** Low +**Inspired by:** [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** BlahST uses Piper (local neural TTS) to speak LLM responses aloud during `blahstbot` and `blahstream` chat sessions, providing a fully local voice-to-voice conversation loop. +**What Agrapha would do:** Use macOS's built-in `say` command (or a local neural TTS model) to read aloud AI-generated summaries or action items at meeting end, providing an audio digest option for users who prefer not to read the output. +**Attribution note (README):** Local text-to-speech for spoken AI responses inspired by [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** S +**Notes:** `say` is zero-dependency on macOS and accessible from Kotlin via `Runtime.exec()`. Neural TTS quality requires Piper or a similar model (~50–200 MB). Meeting-end TTS digest is a niche preference; most users prefer to read summaries. Low priority unless user research shows demand. + +--- + +## Feature: Continuous Dictation Loop (Auto-Restart on Silence) +**Priority:** Low +**Inspired by:** [BlahST](https://github.com/QuantiusBenignus/BlahST) +**What they do:** BlahST's `blooper` runs a continuous loop: record until silence (~3 s), transcribe, paste at caret, immediately restart recording until a longer silence or hotkey interrupt stops the loop. +**What Agrapha would do:** In dictation mode, add a "continuous" sub-mode that automatically restarts recording after each segment is pasted, allowing hands-free long-form dictation into any app without repeated hotkey presses. +**Attribution note (README):** Continuous dictation loop design inspired by [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). +**Effort estimate:** S +**Notes:** Requires VAD (Silero) to detect silence reliably. Depends on: Global Hotkey / Dictation Mode, Silero VAD, and the toggle recording mode being in place. Low priority because VAD-triggered auto-stop in toggle mode achieves a similar UX with less complexity. + +--- + +## Feature: SenseVoice / Paraformer Multilingual Engines +**Priority:** Low +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype) +**What they do:** VoxType supports SenseVoice (CTC ONNX, Chinese/English/Japanese/Korean/Cantonese) and Paraformer (non-autoregressive ONNX, Chinese+English bilingual) as alternative engines for non-English-primary users. +**What Agrapha would do:** Add SenseVoice and/or Paraformer as selectable engines via ONNX Runtime for Java for users whose meetings are primarily in Chinese, Japanese, or Korean, complementing Whisper's multilingual support with faster inference for those languages. +**Attribution note (README):** SenseVoice and Paraformer multilingual ONNX engine patterns inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT). +**Effort estimate:** L +**Notes:** Lower priority than Parakeet (English-first user base). Deliver after the `TranscriptionEngine` abstraction and Parakeet are in place. SenseVoice's emotion and event detection (laughter, applause) could be a differentiating feature for meeting transcription — worth noting as a future angle. + +--- + +*Backlog last updated: 2026-05-09. 31 items total: 11 High, 12 Medium, 8 Low.* diff --git a/project_plans/agrapha-feature-research/implementation/validation.md b/project_plans/agrapha-feature-research/implementation/validation.md new file mode 100644 index 0000000..0d4b958 --- /dev/null +++ b/project_plans/agrapha-feature-research/implementation/validation.md @@ -0,0 +1,112 @@ +# Validation Report — Agrapha Feature Backlog + +Validated against: `requirements.md`, `research/voxtype.md`, `research/blahst.md`, `research/handy.md`, `research/comparable-projects.md` +Date: 2026-05-09 + +--- + +## Requirements Coverage + +| Feature Area | Covered? | Backlog Item(s) | +|---|---|---| +| Push-to-talk / dictation mode | YES | "Toggle vs Push-to-Talk Recording Modes" (High), "Global Hotkey / Dictation Mode" (High) | +| Additional transcription engines beyond Whisper | PARTIAL — no High item | "Parakeet ONNX Engine" (Medium), "Moonshine Engine" (Low), "SenseVoice/Paraformer" (Low), "macOS Native Speech Framework" (Medium) | +| LLM integration patterns | YES | "Multiple Named LLM Post-Processing Prompts" (High), "One-Shot Speech-to-LLM" (Medium), "Apple Intelligence On-Device Post-Processing" (Medium) | +| Export formats (Markdown, JSON, SRT, VTT) | YES | "SRT and VTT Export" (High), "JSON Export" (High) | + +**Requirements gap:** Feature area 2 (additional transcription engines) has no High-priority backlog item. The requirements document states all four feature areas must be covered by at least one High-priority item. Parakeet ONNX Engine is the strongest candidate for promotion to High — it is the only alternative engine with a clear implementation path (ONNX Runtime for Java) and concrete evidence from three projects (VoxType, Handy, Meetily, plus the newly discovered Hex). + +--- + +## Attribution Issues + +4 issues found. + +**Issue 1 — Parakeet ONNX Engine: Meetily's Parakeet implementation method overstated** +- Item: "Parakeet ONNX Engine" +- What's wrong: "What they do" states that VoxType, Handy, and Meetily all support Parakeet "via ONNX Runtime." The research confirms ONNX Runtime for VoxType and Handy. For Meetily, the research only says it supports "Whisper or Parakeet models" — the underlying runtime for Parakeet in Meetily is not confirmed. Meetily's architecture note specifies only "Rust whisper.cpp bindings for transcription; SortFormer ONNX for diarization." Claiming Meetily uses ONNX Runtime specifically for Parakeet is an overstatement. +- Suggested fix: Change "What they do" to: "VoxType and Handy support NVIDIA Parakeet (FastConformer TDT) via ONNX Runtime as an alternative to Whisper. Meetily also supports Parakeet models (implementation details not confirmed in available source)." + +**Issue 2 — Global Hotkey / Dictation Mode: BlahST omitted from attribution note** +- Item: "Global Hotkey / Dictation Mode (Paste to Active App)" +- What's wrong: BlahST is listed in the "Inspired by" field and correctly described in "What they do," but it is absent from the "Attribution note (README)" field. The note credits VoxType, Handy, and open-wispr only. +- Suggested fix: Add BlahST to the attribution note: "...inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), [open-wispr](https://github.com/human37/open-wispr) (MIT), and [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT)." + +**Issue 3 — Silero VAD: WhisperWriter omitted from attribution note** +- Item: "Silero VAD (Voice Activity Detection)" +- What's wrong: WhisperWriter is correctly listed in "Inspired by" and credited in "What they do" for making silence duration configurable. However, the attribution note credits only Handy. +- Suggested fix: Update attribution note to: "Silero VAD integration for silence trimming inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT)." + +**Issue 4 — macOS Native Speech Framework: whisper-mac has no license specified** +- Item: "macOS Native Speech Framework Engine (SFSpeechRecognizer)" +- What's wrong: The attribution note omits a license tag, but more importantly the research explicitly notes whisper-mac has "no license specified" and should be treated as "inspiration only, not for attribution." The backlog promotes it to a full attribution credit anyway. +- Suggested fix: Downgrade the attribution note to a softer acknowledgment — e.g., "macOS native Speech framework engine integration inspired by the engine inventory in [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac) (license unspecified — no code reuse)." Alternatively, cite `SFSpeechRecognizer` as an Apple-documented API and drop the whisper-mac credit entirely, since the underlying technology is Apple's. + +--- + +## Attribution Note Quality + +3 items flagged. + +**Flag 1 — Global Hotkey / Dictation Mode**: Attribution note is missing BlahST (see Issue 2 above). This makes the note inconsistent with the "Inspired by" field, which is the canonical credit record. Fix as described in Issue 2. + +**Flag 2 — Silero VAD**: Attribution note is missing WhisperWriter (see Issue 3 above). Fix as described in Issue 3. + +**Flag 3 — macOS Menu Bar Recording Status Indicator**: The note reads "Always-visible recording status indicator concept inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT)." The word "concept" is appropriately hedged (VoxType's implementation is Waybar/Linux-specific JSON), but the note would be more useful if it specified what the inspiration is: e.g., "Always-visible recording status indicator concept — specifically the real-time JSON state feed powering system bar integration — inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT)." Not a blocking issue, but improves README clarity. + +All markdown link formats are syntactically correct (`[ProjectName](URL)` throughout). No broken or malformed links detected. + +--- + +## New Discovery Integration + +3 items require action. + +**WhisperKit (argmax-oss-swift) — SRT/VTT export credit missing** + +The "SRT and VTT Export" item credits only VoxType. WhisperKit has built-in SRT/VTT subtitle export as a first-class feature (`WhisperKit` produces word-level timestamps and serialises them to SRT/VTT). This is the most direct implementation reference for Agrapha's Kotlin serializer, since WhisperKit documents the data model explicitly (word timestamps → segment grouping → SRT formatting). Both items ("SRT and VTT Export" and "JSON Export") should be reviewed; JSON export is VoxType-only and WhisperKit does not produce JSON meeting exports, so JSON attribution is correct as-is. + +Recommended change for "SRT and VTT Export": +- Add `[argmax-oss-swift (WhisperKit)](https://github.com/argmaxinc/argmax-oss-swift)` to "Inspired by" +- Update attribution note to: "SRT and VTT export format design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT); subtitle data model (word timestamps → segment grouping) noted from [WhisperKit / argmax-oss-swift](https://github.com/argmaxinc/argmax-oss-swift) (MIT)." + +**Hex (kitlangton) — Parakeet engine item should credit it** + +The "Parakeet ONNX Engine" item credits VoxType, Handy, and Meetily. Hex provides a clean, production Swift implementation of dual-engine Parakeet+Whisper switching with a user-facing toggle — the most direct reference design for Agrapha's planned `TranscriptionEngine` abstraction. Since Hex is MIT-licensed and macOS-only (Apple Silicon), it is a closer platform match than VoxType (Linux) or Handy (multi-platform Tauri). + +Recommended change for "Parakeet ONNX Engine": +- Add `[Hex](https://github.com/kitlangton/Hex)` to "Inspired by" +- Update attribution note to include: "...and [Hex](https://github.com/kitlangton/Hex) (MIT) for the multi-engine toggle design pattern." + +**noScribe — no backlog item for transcript correction editor** + +noScribe ships a dedicated correction editor (noScribeEdit) for reviewing and annotating transcripts word-by-word. There is no backlog item for an inline transcript editor in Agrapha. This is a gap relative to the research findings. noScribe's GPL-3.0 license means no code reuse, but the UX pattern (click a word, correct it, re-run a segment) is freely borrowable. + +Recommendation: Add a new Medium-priority backlog item — "Inline Transcript Correction Editor" — inspired by noScribe, noting GPL-3.0 code restrictions. This fits naturally between the existing transcription history feature and the re-transcription feature, and aligns with Agrapha's meeting memory use case (correcting misheard names, technical terms, proper nouns before export). + +--- + +## Verdict + +**NEEDS REVISION** + +The backlog requires the following changes before it is ready to use: + +**Must fix (blocking):** +1. Promote "Parakeet ONNX Engine" from Medium to High priority to satisfy the requirements coverage rule for feature area 2 (additional transcription engines). This is the only gap against the four required High-priority coverage areas. +2. Fix "Parakeet ONNX Engine" — "What they do" field overstates Meetily's Parakeet implementation as "ONNX Runtime" when that is unconfirmed (Issue 1). +3. Add BlahST to the "Global Hotkey / Dictation Mode" attribution note (Issue 2). +4. Add WhisperWriter to the "Silero VAD" attribution note (Issue 3). + +**Should fix (quality):** +5. Resolve the whisper-mac license problem in "macOS Native Speech Framework Engine" attribution note (Issue 4). +6. Add WhisperKit credit to "SRT and VTT Export." +7. Add Hex credit to "Parakeet ONNX Engine." +8. Add a new "Inline Transcript Correction Editor" item (Medium priority) crediting noScribe. +9. Improve specificity of "macOS Menu Bar Recording Status Indicator" attribution note (Flag 3). + +**Count summary:** +- Attribution issues: 4 +- Items needing new discovery credits: 2 (SRT/VTT Export → WhisperKit; Parakeet ONNX Engine → Hex) +- Items that should exist but don't: 1 (Inline Transcript Correction Editor, from noScribe) +- Requirements gaps: 1 (no High-priority engine item) diff --git a/project_plans/agrapha-feature-research/requirements.md b/project_plans/agrapha-feature-research/requirements.md new file mode 100644 index 0000000..2d30ddc --- /dev/null +++ b/project_plans/agrapha-feature-research/requirements.md @@ -0,0 +1,56 @@ +# Agrapha Feature Research — Requirements + +## Overview + +Survey open-source local-first speech-to-text and meeting-transcription projects to build a prioritized feature backlog for Agrapha. Each backlog item must include an attribution note (project name + URL) for the README. + +## Current State of Agrapha + +- macOS desktop app (Kotlin Multiplatform + Compose Desktop) +- Records both mic and system audio via CoreAudio/ScreenCaptureKit JNI +- Transcribes locally with Whisper.cpp via JNI (Apple Neural Engine / CoreML) +- Optional speaker diarization via pyannote.audio +- Optional transcript correction + summarization via LLM (Ollama, OpenAI, Anthropic) +- Exports to Logseq (journal entries with [[links]]) +- Outputs: key points, decisions, action items, full transcript +- Stack: Kotlin Multiplatform · Compose Desktop · SQLDelight · Whisper.cpp JNI + +## Seed References (for README attribution) + +| Project | URL | Stars | Language | Focus | +|---------|-----|-------|----------|-------| +| voxtype | https://github.com/peteonrails/voxtype | 712 | Rust | Push-to-talk STT for Linux/Wayland; 7 engines; meeting mode | +| BlahST | https://github.com/QuantiusBenignus/BlahST | 172 | Shell | Lean whisper.cpp wrapper; LLM integration; continuous dictation | +| Handy | https://github.com/cjpais/Handy | 21364 | Rust/Tauri | Cross-platform desktop STT; VAD; Parakeet; extensible | + +Research should also discover 3–5 additional comparable projects. + +## Feature Areas of Interest + +1. **Push-to-talk / dictation mode** — hold-key-to-record anywhere on screen (not just during meetings) +2. **Additional transcription engines** — Parakeet, Moonshine, SenseVoice, Paraformer beyond Whisper +3. **LLM integration patterns** — speech-to-LLM chat, AI proofreader, one-shot assistant, TTS responses +4. **Export formats** — Markdown, JSON, SRT, VTT (beyond current Logseq/plain-markdown) + +## Deliverable + +A **prioritized feature backlog** structured as: + +``` +## Feature: +**Priority:** High / Medium / Low +**Inspired by:** (s) with URL(s) +**What they do:** <1–2 sentences> +**What Agrapha would do:** <1–2 sentences scoped to Agrapha's macOS/meeting context> +**Attribution note (README):** +**Effort estimate:** XS / S / M / L / XL +``` + +Items should be ordered: High priority first, then Medium, then Low. + +## Constraints + +- Agrapha is macOS-only (Intel + Apple Silicon); Linux-specific features are inspiration only +- Core value proposition is meeting transcription + memory system export — features must fit that context or expand it coherently +- Attribution accuracy matters: only credit a project for things they actually do +- Research agent must also search for comparable projects not in the seed list diff --git a/project_plans/agrapha-feature-research/research/blahst.md b/project_plans/agrapha-feature-research/research/blahst.md new file mode 100644 index 0000000..b19ab85 --- /dev/null +++ b/project_plans/agrapha-feature-research/research/blahst.md @@ -0,0 +1,72 @@ +# BlahST — Research + +## Summary + +BlahST (172 stars, zsh shell scripts, MIT) is a minimal Linux STT toolkit built on top of whisper.cpp. It provides six composable scripts — each a hotkey-triggered shell one-liner — covering basic transcription, multilingual input, continuous dictation, LLM chat, and streaming speech-to-speech conversation. Its design philosophy is zero-UI ("the best UI is no UI at all"), using the clipboard or PRIMARY selection as the universal paste mechanism. The LLM integration pattern (speech → transcription → llama-server → Piper TTS) is directly relevant to Agrapha's planned AI features. + +## Feature Inventory + +### Scripts + +| Script | Function | +|--------|----------| +| `wsi` | Core STT: record mic via sox, transcribe via local whisper.cpp or whisper.cpp server API or whisperfile, output to clipboard or PRIMARY selection. Silence-detection stops recording automatically (~2 s of silence at 6% threshold). Prevents duplicate invocations via `pidof` guard. | +| `wsiml` | Multilingual variant of `wsi`: `-l fr` forces language, `-t` translates to English. Supports multiple hotkeys per language. | +| `wsiAI` | One-shot LLM assistant: transcribes speech, constructs prompt ("Assistant …", "Translator …", "Computer, proofread …"), sends to llama-server or llamafile, gets text response, speaks via Piper TTS, and places answer in clipboard. | +| `blooper` | Continuous dictation loop: transcribes in a loop, autopastes on each pause (~3 s silence threshold), exits on longer silence or hotkey interrupt; uses xdotool/ydotool for autopaste. | +| `blahstbot` | Low-latency speech chat: record → whisper.cpp → llama-server → Piper TTS → spoken + clipboard response. Context persists between turns; "RESET CONTEXT" spoken command clears it. `-n` flag uses LAN server for offload. | +| `blahstream` | Streaming speech-to-speech chat: like `blahstbot` but LLM response streamed token-by-token; each chunk spoken and autopasted in real time. Context compression via summarization. X11: buffers paste until original window regains focus. WIP/experimental. | + +### Additional Features + +- **whisperfile support**: portable single-file whisper model executables (`-w` flag); no compilation needed +- **AI proofreader**: triggered by keyword in speech ("Computer, proofread …"); sends currently selected text to LLM, replaces selection with corrected version — no clipboard interaction required +- **AI translator**: "Translator …" spoken keyword → LLM translates text, speaks result, places in clipboard +- **Clipboard vs. PRIMARY selection**: `wsi` (no flag) uses clipboard (Ctrl+V paste); `wsi -p` uses PRIMARY selection (middle-mouse paste); both support X11 and Wayland +- **Network transcription**: whisper.cpp HTTP server API (`-n` or explicit `IP:PORT` argument); sub-150 ms round-trip on LAN +- **Autopaste**: xdotool (X11) or ydotool (Wayland) simulates Ctrl+V or middle-click after transcription +- **Single-instance guard**: `pidof -q && exit 0` prevents duplicate hotkey presses from stacking +- **Hotkey interrupt**: `pkill rec` as a second hotkey cancels in-flight recording immediately +- **Piper TTS**: neural local TTS for spoken LLM responses; no cloud, supports multiple voice models +- **Centralised config**: all tools share `blahst.cfg`; local per-script overrides possible +- **Microphone indicator**: GNOME top-bar mic icon appears for duration of recording (uses system desktop notification mechanism) + +## LLM Integration Approach + +BlahST treats each LLM interaction as a one-shot or stateful HTTP call to a locally-running `llama-server` (llama.cpp) or a `llamafile` executable: + +1. **Speech capture**: `rec` (sox) captures at 16 kHz until silence +2. **Transcription**: `whisper-cli` (local) or HTTP POST to `/inference` (server/whisperfile) +3. **Prompt construction**: shell string interpolation; keyword in speech ("Computer …", "Assistant …") determines which system prompt to prepend +4. **LLM call**: `curl` POST to `llama-server` API; streaming (`blahstream`) or one-shot (`blahstbot`) +5. **TTS output**: response piped to `piper` for local neural TTS; output played via `aplay` +6. **Clipboard/paste**: response also written to clipboard for manual paste + +Context management in `blahstbot` is manual: conversation history held in a shell array, serialised to JSON, truncated or summarised when it grows too large. + +## Continuous Dictation Design + +`blooper` uses sox silence detection (`silence 1 0.1 1% 1 2.0 5%`) to end each segment, then immediately autopastes and restarts recording. The loop exits on ≥3 s silence or hotkey interrupt (`pkill rec`). Text accumulates at the keyboard caret via xdotool/ydotool. This is entirely within the terminal/shell — no GUI required. + +## Architecture Notes + +- Pure zsh scripts (~100–300 lines each); no compiled components beyond whisper.cpp and llama.cpp +- All IPC via clipboard, PRIMARY selection, and signals (`pkill`) +- Stateless between invocations (except `blahstbot` conversation array) +- Not portable to macOS as-is (sox flags, xdotool/ydotool, xsel/wl-copy are Linux-specific), but the **design patterns** are portable +- Relevant to Agrapha: the clipboard-as-universal-output pattern, the one-shot LLM prompt construction pattern, and the keyword-triggered AI proofreader concept all map cleanly to Kotlin/Compose + +## Agrapha Relevance + +| Feature | Rationale | +|---|---| +| **Continuous dictation mode** (`blooper`) | Agrapha is meeting-first but users want always-on dictation between meetings. `blooper`'s loop-until-silence pattern is the reference design: record → transcribe → paste → repeat. Could be Agrapha's "Dictation Mode" toggle. | +| **AI proofreader triggered by selected text** | Select text in any app, say "proofread", and it's replaced. This would complement Agrapha's existing LLM integration at near-zero engineering cost (services layer already exists). | +| **One-shot speech-to-LLM** (`wsiAI` pattern) | User dictates a question; Agrapha sends it to configured LLM; speaks/displays the answer. Natural extension of the existing Ollama/Anthropic/OpenAI integration. | +| **Keyword-dispatch to different LLM prompts** | Spoken prefixes ("Summarise …", "Action items …", "Draft email …") can route to different pre-configured system prompts. Low engineering cost, high UX value in a meeting context. | +| **Network transcription offload** | Agrapha runs on the meeting device; a second Mac or a home server could run whisper-server for faster inference. BlahST's `-n` pattern is the reference for how to add a remote backend. | +| **Piper TTS for spoken responses** | Agrapha could speak AI-generated summaries or action items at meeting end. Piper runs fully locally; the macOS analogue is `say` (built-in) or a native neural TTS. | + +## Attribution Note + +> Continuous dictation loop design and speech-to-LLM one-shot pattern inspired by [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT). diff --git a/project_plans/agrapha-feature-research/research/comparable-projects.md b/project_plans/agrapha-feature-research/research/comparable-projects.md new file mode 100644 index 0000000..5843295 --- /dev/null +++ b/project_plans/agrapha-feature-research/research/comparable-projects.md @@ -0,0 +1,324 @@ +# Comparable Projects — Research + +## Summary + +Five additional open-source local-first STT and meeting-transcription projects were identified beyond the three seed projects. The most Agrapha-relevant are Meetily (meeting assistant closest in intent to Agrapha), OpenWhispr (macOS-native, VA + calendar integration, local diarization), and whisper-writer (four recording modes, VAD, continuous recording). All are MIT-licensed. Cloud-only tools and mobile-only apps were excluded. + +--- + +## Project 1 — Meetily + +**URL**: https://github.com/Zackriya-Solutions/meetily +**Stars**: 11,649 +**Language**: Rust (backend) + Next.js/pnpm (frontend) +**License**: MIT (Community Edition; PRO commercial tier also available) +**Platform**: macOS, Windows (Linux build from source) + +### Feature Inventory + +- Real-time meeting transcription using Whisper or Parakeet models (no cloud) +- Speaker diarization with SortFormer (Nvidia model; real-time on-device) +- Microphone + system audio capture simultaneously with intelligent ducking and clipping prevention +- AI-powered summaries: Ollama (local), Claude, Groq, OpenRouter, or any OpenAI-compatible endpoint +- Import existing audio files; re-transcribe with different model or language (Beta) +- GPU acceleration: Apple Silicon Metal + CoreML; NVIDIA CUDA; AMD/Intel Vulkan — auto-enabled at build time +- Export to PDF, DOCX (PRO); community edition supports standard text export +- Custom summary templates (PRO) +- Auto-meeting detection (PRO) +- GDPR compliance tooling (PRO) + +### Architecture Notes + +- Tauri v2 (Rust) backend + Next.js frontend (similar to Handy's Tauri architecture) +- Rust whisper.cpp bindings for transcription; SortFormer ONNX for diarization +- SQLite local storage; all recordings and transcripts on-device +- macOS: CoreML acceleration path mirrors Agrapha's existing JNI + CoreML pipeline + +### Agrapha Relevance + +Meetily is the closest peer to Agrapha in intent (meeting minutes + summaries + local-first). Key borrowable patterns: +- **SortFormer for real-time diarization**: more accurate than pyannote.audio for live streams; ONNX-based so potentially JNI-accessible +- **Simultaneous mic + system audio capture with ducking**: Agrapha already does this via CoreAudio JNI; Meetily's ducking/clipping-prevention implementation is worth examining for improving Agrapha's audio quality +- **Re-transcribe with different model**: let users re-process saved meetings with a newer or larger model; useful when Whisper large-v3 becomes faster on newer Apple Silicon + +### Attribution Note + +> Re-transcription with model selection and real-time diarization design inspired by [Meetily](https://github.com/Zackriya-Solutions/meetily) (MIT). + +--- + +## Project 2 — OpenWhispr + +**URL**: https://github.com/OpenWhispr/openwhispr +**Stars**: 2,998 +**Language**: TypeScript (Electron 41 + React 19) +**License**: MIT +**Platform**: macOS (Apple Silicon + Intel), Windows, Linux + +### Feature Inventory + +- Voice dictation: global hotkey → dictate into any app with automatic pasting +- AI agent: talk to GPT-5, Claude, Gemini, Groq, or local models with a named voice assistant +- Meeting transcription: auto-detect Zoom, Teams, FaceTime calls; live speaker diarization; voice fingerprinting; Google Calendar integration +- Local speaker diarization: on-device speaker labelling with voice fingerprint recognition across meetings (no cloud) +- Notes: create/organise/search with folders, semantic search, cloud sync, AI actions +- Local or cloud transcription: Whisper (whisper.cpp), NVIDIA Parakeet (sherpa-onnx), cloud providers +- Public API and MCP server: programmatic access to notes and transcriptions; Claude/other AI assistants can call the MCP server +- All core features work with local models (no API key needed) + +### Architecture Notes + +- Electron 41 + React 19 + Tailwind CSS v4 + better-sqlite3 + shadcn/ui +- whisper.cpp for local Whisper; sherpa-onnx for Parakeet +- Semantic search over notes (embedding model via HF) +- MCP server exposes transcription history and notes to AI assistants + +### Agrapha Relevance + +- **Auto-detect meeting apps (Zoom/Teams/FaceTime)**: Agrapha could auto-start recording when a known video-call app becomes active — reduces friction dramatically for meeting users +- **Voice fingerprinting across sessions**: persistent speaker identity across meetings ("Alice from Acme always sounds like this") rather than per-meeting anonymous Speaker 1/2. Directly relevant to Agrapha's diarization +- **MCP server**: exposing Agrapha's transcript history via MCP would let Claude Desktop, Cursor, or any MCP-aware AI assistant query past meeting content. Low implementation cost against Agrapha's existing API surface +- **Semantic search over transcripts**: SQLDelight + a local embedding model (whisper-derived or sentence-transformers) for "find the meeting where we discussed pricing" — high value for Agrapha's memory-system export narrative + +### Attribution Note + +> Auto-detection of video-call applications, voice fingerprinting design, and MCP server integration inspired by [OpenWhispr](https://github.com/OpenWhispr/openwhispr) (MIT). + +--- + +## Project 3 — WhisperWriter + +**URL**: https://github.com/savbell/whisper-writer +**Stars**: 1,049 +**Language**: Python (PyQt5 GUI, faster-whisper) +**License**: MIT (implied — no LICENSE file found but standard open-source practices stated) +**Platform**: Windows, macOS, Linux + +### Feature Inventory + +- Four recording modes: `continuous` (auto-restart until shortcut pressed again), `voice_activity_detection` (stop on silence, wait for re-trigger), `press_to_toggle`, `hold_to_record` +- VAD filter via Silero (optional); silence duration configurable (default 900 ms) +- Local model (faster-whisper / CTranslate2) or OpenAI API (or any OpenAI-compatible local endpoint like LocalAI) +- Configurable `initial_prompt` for domain vocabulary conditioning +- `condition_on_previous_text`: uses previous transcription as next prompt (improves coherence in continuous mode) +- Post-processing: remove trailing period, add trailing space, remove capitalisation, configurable key-press delay +- Status window (small, hideable) shows current stage (recording / transcribing) +- Configurable `activation_key` (default `ctrl+shift+space`) +- `input_method`: pynput (default) or alternative backends + +### Architecture Notes + +- Python + PyQt5; faster-whisper (CTranslate2) for local inference — not portable to JVM +- Design patterns (four recording modes, condition_on_previous_text, initial_prompt for vocab) are directly portable + +### Agrapha Relevance + +- **`condition_on_previous_text`**: pass the previous chunk's transcription as Whisper's initial_prompt for long continuous recordings. Reduces repetition errors at chunk boundaries. Agrapha's meeting chunking could adopt this immediately via the existing JNI bridge (WhisperParams already accepts initial_prompt) +- **Four recording modes in one settings field**: clean enum design; Agrapha should offer at minimum `continuous` (meeting mode) and `hold_to_record` (dictation mode) +- **Local API endpoint support**: swap OpenAI API for a local whisper.cpp server; Agrapha could add a remote Whisper endpoint fallback for users with a faster Mac on their LAN + +### Attribution Note + +> Continuous recording mode with previous-text conditioning and four recording mode design inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). + +--- + +## Project 4 — open-wispr + +**URL**: https://github.com/human37/open-wispr +**Stars**: 127 +**Language**: Swift (native macOS app) +**License**: MIT +**Platform**: macOS only (Apple Silicon, Metal acceleration) + +### Feature Inventory + +- Global hotkey to start/stop recording +- Whisper.cpp on CPU/GPU (Metal); temp file approach (audio → temp file → transcribe → delete) +- Fully offline; no network requests except model download +- Pastes transcription to active app +- Open-source Swift app; simple, minimal codebase + +### Architecture Notes + +- Native Swift + whisper.cpp; Metal GPU via Core ML/whisper.cpp metal backend +- Very similar architecture to what Agrapha does, but in Swift instead of Kotlin +- Small codebase (~few hundred lines) — useful as a reference implementation for macOS-specific integration patterns (accessibility API for paste, hotkey registration via NSEvent) + +### Agrapha Relevance + +- **macOS accessibility API for paste to active app**: open-wispr's Swift source shows how to use `AXUIElement` and `CGEvent` to inject text into the frontmost app without Ctrl+V. Agrapha's JNI could call the same macOS APIs via JNA or a lightweight Swift bridge +- **Global hotkey registration**: shows the correct `addGlobalMonitorForEvents` / `addLocalMonitorForEvents` approach on macOS for hotkeys without accessibility permissions + +### Attribution Note + +> macOS-native global hotkey registration and accessibility API paste design inspired by [open-wispr](https://github.com/human37/open-wispr) (MIT). + +--- + +## Project 5 — whisper-mac (Explosion-Scratch) + +**URL**: https://github.com/Explosion-Scratch/whisper-mac +**Stars**: 45 +**Language**: TypeScript (Electron/Tauri) +**License**: Not specified +**Platform**: macOS + +### Feature Inventory + +- Local-first transcription for macOS +- Supports Parakeet, WhisperCPP, Vosk, macOS native Speech framework (all local), or cloud (Gemini, Mistral) +- Described as "extensible" — plugin-friendly architecture for adding engines + +### Architecture Notes + +- Low star count and no license specified; treat as inspiration only, not for attribution +- Most interesting differentiator: **macOS native Speech framework** (SFSpeechRecognizer) as one of the backends — zero additional model download, built into every Mac since macOS 10.15 + +### Agrapha Relevance + +- **macOS native Speech framework as a fast/free engine**: SFSpeechRecognizer runs on-device (no download), supports English well, and is already optimised by Apple. Could be offered as the "quick start" engine before a user has downloaded a Whisper model. Latency is ~100–200 ms for short utterances +- Note: SFSpeechRecognizer sends audio to Apple servers by default unless `requiresOnDeviceRecognition = true` is set (available iOS 13+ / macOS 12+). This restriction must be surfaced to users in Agrapha's privacy model + +### Attribution Note + +> macOS native Speech framework engine integration pattern noted from [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac). + +--- + +## Summary Table + +| Project | Stars | Language | Most Relevant Feature for Agrapha | +|---------|-------|----------|------------------------------------| +| [Meetily](https://github.com/Zackriya-Solutions/meetily) | 11,649 | Rust | SortFormer real-time diarization; audio ducking | +| [OpenWhispr](https://github.com/OpenWhispr/openwhispr) | 2,998 | TypeScript/Electron | Meeting app auto-detection; voice fingerprinting; MCP server | +| [WhisperWriter](https://github.com/savbell/whisper-writer) | 1,049 | Python | condition_on_previous_text; four recording modes | +| [open-wispr](https://github.com/human37/open-wispr) | 127 | Swift | macOS accessibility API paste; hotkey registration | +| [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac) | 45 | TypeScript | macOS native Speech framework as engine | + +--- + +## Additional Discoveries + +Three additional projects discovered in supplementary search, none of which overlapped with the five projects above. + +--- + +### Discovery 1 — argmax-oss-swift (WhisperKit + SpeakerKit + TTSKit) + +**URL**: https://github.com/argmaxinc/argmax-oss-swift +**Stars**: 6,072 +**Language**: Swift +**License**: MIT +**Platform**: macOS, iOS, visionOS (Apple Silicon) + +#### Feature Inventory + +- **WhisperKit**: CoreML-accelerated Whisper STT, streaming chunk-by-chunk transcription, word-level timestamps, SRT/VTT subtitle export, custom vocabulary, multi-language +- **SpeakerKit**: on-device speaker diarization via pyannote ONNX; combines with WhisperKit via a single API call to produce diarized transcripts +- **TTSKit**: on-device text-to-speech via Qwen-TTS; real-time streaming playback; custom voice styles; saves audio to file +- Local HTTP server exposing OpenAI-compatible `/v1/audio/transcriptions` endpoint — drop-in for apps that already talk to OpenAI Whisper API +- Swift Package Manager distribution; CLI (`whisperkit-cli`, `speakerkit-cli`, `ttskit-cli`) for scripting +- SRT/VTT subtitle file export built into WhisperKit + +#### Architecture Notes + +- Pure Swift; CoreML for inference — no Python dependency, no whisper.cpp JNI bridge required +- SpeakerKit wraps pyannote ONNX with a Swift-native API; RTTM output format for downstream tooling +- Pro SDK (closed source) adds real-time diarization, Android Kotlin support, Argmax Local Server for non-native apps +- Hugging Face model hub for model downloads (100k+ downloads/month) + +#### Agrapha Relevance + +- **SRT/VTT export**: WhisperKit's built-in subtitle export shows the exact data model needed (word timestamps → segment grouping → SRT formatting). Agrapha can adopt the same approach over its existing JNI bridge without a full Swift rewrite +- **TTSKit as read-back engine**: Agrapha currently has no TTS. TTSKit is a ready-made on-device TTS library for Apple Silicon; could be called from Agrapha via a thin Swift JNI bridge to provide "read meeting back aloud" or dictation confirmation audio +- **SpeakerKit diarization API design**: SpeakerKit's Swift API (`SpeakerKit.diarize(audioURL:)` → `[(speaker, start, end)]`) is a clean interface pattern worth mirroring in Agrapha's `DiarizationService` abstraction layer +- **OpenAI-compatible local server**: Agrapha could expose the same `/v1/audio/transcriptions` endpoint, enabling integration with any tool that already speaks OpenAI Whisper API + +#### Attribution Note + +> SRT/VTT export structure, TTSKit on-device TTS integration pattern, and SpeakerKit diarization API design noted from [argmax-oss-swift](https://github.com/argmaxinc/argmax-oss-swift) (MIT). + +--- + +### Discovery 2 — noScribe + +**URL**: https://github.com/kaixxx/noScribe +**Stars**: 1,964 +**Language**: Python (faster-whisper + pyannote + custom Qt editor) +**License**: GPL-3.0 +**Platform**: macOS (Apple Silicon + Intel), Windows, Linux + +#### Feature Inventory + +- Automated transcription of recorded interviews and spoken content using faster-whisper (CTranslate2) +- Speaker diarization via pyannote.audio; distinguishes multiple speakers in post-processed audio +- Built-in transcript editor (noScribeEdit) for reviewing, correcting, and annotating transcripts +- GPU acceleration: CUDA (NVIDIA) on Windows/Linux; Apple Silicon Metal path on macOS +- Supports ~60 languages +- Targeted at qualitative social research and journalism (structured for verbatim interview transcription) +- Completely offline; no network calls after model download +- Free, always-free policy; actively maintained + +#### Architecture Notes + +- Python application; not portable as a library to JVM — design patterns only +- Distributes as a bundled installer (not available via Homebrew or package manager) +- Editor is a separate app (noScribeEdit) shipped alongside the transcriber +- GPL-3.0 license: cannot copy code into Agrapha (Apache/MIT), but design patterns are freely borrowable + +#### Agrapha Relevance + +- **Dedicated transcript editor UX**: noScribe ships a standalone correction editor tightly integrated with the transcription output. Agrapha's roadmap could include an inline transcript editor (click to correct a word, re-run a segment) — noScribe's UX pattern is the reference implementation +- **Interview/research mode vs. meeting mode**: noScribe is optimised for long, high-quality post-processed transcription (one hour in, three hours out) rather than real-time. Agrapha could add an "accuracy mode" that uses a larger model and longer processing time for important archived recordings +- **pyannote diarization pipeline details**: noScribe's open source shows how pyannote is tuned for interview-style audio (2–4 speakers, conversational overlap) — relevant parameters for Agrapha's DiarizationService configuration + +#### Attribution Note + +> Transcript correction editor UX and accuracy-mode (post-processed, large-model) transcription design noted from [noScribe](https://github.com/kaixxx/noScribe) (GPL-3.0 — patterns only, no code reuse). + +--- + +### Discovery 3 — Hex + +**URL**: https://github.com/kitlangton/Hex +**Stars**: 2,030 +**Language**: Swift (SwiftUI + Swift Composable Architecture) +**License**: MIT +**Platform**: macOS, Apple Silicon only + +#### Feature Inventory + +- Global hotkey dictation: press-and-hold or double-tap-to-lock recording modes +- Dual engine support: **Parakeet TDT v3** (via FluidAudio, default — fast, multilingual, cloud-optimised) and **WhisperKit** (fully on-device) +- Transcribed text pasted into frontmost app via macOS accessibility API +- Changeset-based release workflow; actively developed +- Homebrew Cask distribution (`brew install --cask kitlangton-hex`) +- MIT licensed + +#### Architecture Notes + +- Swift + SwiftUI + Swift Composable Architecture (TCA) for state management +- FluidAudio library wraps Parakeet TDT v3 (NeMo model from NVIDIA); WhisperKit as fallback/alternative +- Engine selection is a user preference — demonstrates clean multi-engine abstraction in Swift +- Very small focused codebase; press-and-hold vs. lock-toggle is the entire UX + +#### Agrapha Relevance + +- **Multi-engine abstraction (Parakeet + Whisper)**: Hex shows how to cleanly expose two radically different STT backends (cloud-optimised Parakeet vs. fully local WhisperKit) under a single preference toggle. Agrapha's engine abstraction layer (`WhisperTranscriptionService`) could use the same pattern to add Parakeet or macOS SFSpeechRecognizer as alternative engines +- **Parakeet TDT v3 via FluidAudio**: Parakeet is reported to be significantly faster than Whisper for real-time dictation. Agrapha could evaluate FluidAudio as a low-latency dictation engine alternative to whisper.cpp, especially for the planned "instant dictation" mode +- **Press-and-hold vs. lock-toggle UX**: Two recording modes in a single global hotkey interaction (hold = momentary, double-tap = latching) is a UX pattern worth adopting in Agrapha to reduce the need for a visible UI during dictation + +#### Attribution Note + +> Multi-engine toggle design (Parakeet + WhisperKit) and press-and-hold vs. lock-toggle recording hotkey UX inspired by [Hex](https://github.com/kitlangton/Hex) (MIT). + +--- + +### Additional Discoveries Summary Table + +| Project | Stars | Language | Most Relevant Feature for Agrapha | +|---------|-------|----------|------------------------------------| +| [argmax-oss-swift (WhisperKit)](https://github.com/argmaxinc/argmax-oss-swift) | 6,072 | Swift | SRT/VTT export; TTSKit on-device TTS; SpeakerKit diarization API | +| [noScribe](https://github.com/kaixxx/noScribe) | 1,964 | Python | Transcript correction editor UX; accuracy-mode (large model) transcription | +| [Hex](https://github.com/kitlangton/Hex) | 2,030 | Swift | Multi-engine abstraction (Parakeet + WhisperKit); press-hold vs. lock-toggle hotkey | diff --git a/project_plans/agrapha-feature-research/research/handy.md b/project_plans/agrapha-feature-research/research/handy.md new file mode 100644 index 0000000..8fdbba6 --- /dev/null +++ b/project_plans/agrapha-feature-research/research/handy.md @@ -0,0 +1,81 @@ +# Handy — Research + +## Summary + +Handy (21,364 stars, Rust/Tauri v2, MIT) is the most popular open-source local-first STT desktop app. Press a shortcut, speak, release; text appears in any app. It supports Whisper and Parakeet V3 models, Silero VAD, a persistent transcription history (SQLite), Apple Intelligence post-processing, a custom-words/dictionary feature, filler-word filtering, configurable LLM post-processing, audio feedback, a Raycast extension, and both push-to-talk and toggle modes. Its macOS support is first-class and it is the most direct architectural reference for Agrapha's dictation mode aspirations. + +## Feature Inventory + +- **Global keyboard shortcut**: configurable; push-to-talk (hold) or toggle (press to start/stop) — `push_to_talk: bool` in settings +- **VAD with Silero**: `SmoothedVad` over `SileroVad` — trims silence before and after speech, reducing transcription latency and hallucinations +- **Transcription engines**: + - Whisper (whisper-rs, local whisper.cpp bindings): Small/Medium/Turbo/Large models; GPU acceleration (Metal on Apple Silicon, CUDA on NVIDIA) + - Parakeet V3 (transcribe-rs, ONNX Runtime): CPU-optimised, FastConformer TDT; ~5× real-time on mid-range CPU; automatic language detection; no GPU required + - GigaAM, Canary, Cohere, Moonshine, SenseVoice also present in `LoadedEngine` enum (from source code) +- **Paste method**: `PasteMethod` enum — `Direct` (rdev keyboard injection) or `CtrlV` (clipboard + Ctrl+V simulation); auto-selected by platform +- **Transcription history**: SQLite via rusqlite with rolling migrations; stores `transcription_text`, `post_processed_text`, `post_process_prompt`, `post_process_requested`, `saved` flag, `title`, timestamp; configurable `history_limit` and `recording_retention_period`; Raycast extension browses history +- **Custom words / dictionary**: `custom_words: Vec` injected as Whisper initial_prompt or Parakeet custom vocabulary to bias recognition toward user-defined terms +- **Custom filler words**: `custom_filler_words: Option>` — words to strip from output (e.g., "um", "uh") +- **Word correction threshold**: `word_correction_threshold: f64` — fuzzy match confidence for custom word substitution +- **LLM post-processing**: configurable providers (OpenAI-compatible endpoints); multiple named `LLMPrompt` presets selectable; triggered per-transcription or on demand; result stored separately from raw transcription; `post_process_enabled` toggle +- **Apple Intelligence post-processing**: Rust → Swift FFI (`apple_intelligence.rs`); calls `process_text_with_system_prompt_apple()` via C-compatible struct; checks availability with `is_apple_intelligence_available()`; works fully on-device using Apple's on-device LLM (no API key) +- **Audio feedback**: `audio_feedback: bool`, `audio_feedback_volume: f32`, `SoundTheme` enum; plays sound on recording start/stop +- **Model unload timeout**: `ModelUnloadTimeout` — auto-unloads model after N seconds of idle to free RAM +- **Overlay**: configurable recording indicator overlay (position, size); disabled by default on Linux +- **Autostart**: `autostart_enabled` — launch at login +- **Start hidden**: `start_hidden` — no window on launch, only tray icon +- **System tray**: tray icon with context menu; `--no-tray` flag to disable +- **CLI flags for remote control**: `handy --toggle-transcription`, `--toggle-post-process`, `--cancel`; send commands to running instance via single-instance plugin; enables compositor/hotkey-daemon integration +- **Unix signal control**: SIGUSR1 (toggle with post-process), SIGUSR2 (toggle); enables shell script / window manager integration +- **Debug mode**: Cmd+Shift+D (macOS) / Ctrl+Shift+D; verbose logging +- **Speaker muting**: `set_mute()` mutes system audio output during recording (Windows COM, Linux PipeWire/PulseAudio/ALSA, macOS AppleScript) +- **Raycast integration**: official Raycast extension — start/stop recording, browse history, manage dictionary, switch models/languages +- **Clamshell detection**: `helpers::clamshell` — detects lid-closed state on macOS (relevant for external mic selection) + +## Architecture Notes + +- **Tauri v2 + Rust**: frontend is React + TypeScript + Tailwind CSS; backend is Rust with Tauri commands; type-safe bridge via tauri-specta +- **Audio pipeline**: cpal (cross-platform audio I/O) → SmoothedVad (Silero) → ring buffer → transcription engine thread; idle stream timeout (30 s) +- **TranscriptionManager**: RAII `LoadingGuard` ensures model is always unloaded on error; Arc>> shared across threads; idle watcher thread auto-unloads model after configurable timeout; Condvar-based loading serialisation prevents concurrent model loads +- **Engine enum**: `LoadedEngine` covers Whisper, Parakeet, Moonshine, MoonshineStreaming, SenseVoice, GigaAM, Canary, Cohere — all dispatched from a single manager +- **History**: SQLite with rusqlite; schema migrations via rusqlite_migration; audio recordings stored as files alongside DB; `saved` flag for user-marked favourites +- **Apple Intelligence FFI**: `extern "C"` declarations link to Swift functions compiled into the app bundle; the `AppleLLMResponse` C struct bridges ownership safely +- **Shortcut**: `rdev` for global key events; `handy_keys.rs` + `tauri_impl.rs` implement the shortcut handler state machine +- **Relevant to Agrapha**: The TranscriptionManager idle-unload pattern, the custom-words/Whisper initial_prompt trick, and the Apple Intelligence FFI approach are all directly applicable to Agrapha's Kotlin/JNI stack. The history SQLite schema (with post-processed text as a separate column) mirrors Agrapha's SQLDelight setup closely. + +## Push-to-Talk vs Toggle Implementation + +Settings field `push_to_talk: bool` (default `true`) controls the recording trigger mode: +- **Push-to-talk**: global shortcut key-down starts recording; key-up stops and transcribes +- **Toggle**: first key-down starts; second key-down stops and transcribes +- CLI flags `--toggle-transcription` / `--toggle-post-process` allow external toggle from compositor or Raycast without knowing current state + +## Dictionary / Custom Vocabulary + +`custom_words: Vec` is injected into Whisper as the `initial_prompt` parameter (biases beam search toward those token sequences) and into Parakeet as a custom word list. `word_correction_threshold: f64` controls fuzzy-match post-correction. This is the standard pattern for domain-specific vocabulary (product names, technical terms, people's names) without fine-tuning. + +## macOS-Specific Notes + +- Metal GPU acceleration for Whisper via feature flag at build time +- Apple Intelligence FFI: on-device LLM available on M1+ Macs running macOS Sequoia 15.1+; checked at runtime before offering as provider option +- Clamshell detection for lid-closed Mac scenarios (external monitor + keyboard setups) +- AppleScript used for system audio muting on macOS +- Homebrew cask: `brew install --cask handy` + +## Agrapha Relevance + +| Feature | Rationale | +|---|---| +| **Custom words / dictionary** | Agrapha users transcribe recurring names, project codes, product names. Injecting a custom word list as Whisper initial_prompt is a 1-day implementation against the existing JNI bridge. High value, low effort. | +| **Apple Intelligence post-processing** | Agrapha targets macOS M1+; the same Swift FFI pattern could provide on-device LLM correction without any API key or network. Directly applicable. | +| **Transcription history with saved-favourite flag** | Agrapha has SQLDelight; adding a `transcription_history` table with `saved`, `post_processed_text`, and `post_process_prompt` columns mirrors Handy's schema exactly. Users want to find past meeting snippets. | +| **Filler word stripping** | Remove "um", "uh", "you know" post-transcription. Trivial regex; significant output quality improvement for meeting minutes. | +| **LLM post-processing with multiple named prompts** | Agrapha has one LLM path (summary). Handy's multiple named prompts (e.g., "clean grammar", "bullet points", "email") is a natural extension. | +| **Audio feedback on start/stop** | Users need eyes-free confirmation. Low effort. | +| **Toggle vs push-to-talk** | Agrapha should offer both modes; meeting users prefer toggle so they can set-and-forget; dictation users prefer push-to-talk for precision. | +| **Model idle unload** | Whisper models (1.5–3 GB) should auto-unload after idle. Handy's idle watcher + RAII guard is the reference pattern. | +| **CLI remote control flags** | `handy --toggle-transcription` enables Raycast/Alfred/Shortcuts integration without the UI. Agrapha could expose `agrapha record toggle` for the same ecosystem integration. | + +## Attribution Note + +> Custom-vocabulary/dictionary design, Apple Intelligence post-processing integration, and transcription history schema inspired by [Handy](https://github.com/cjpais/Handy) (MIT). diff --git a/project_plans/agrapha-feature-research/research/voxtype.md b/project_plans/agrapha-feature-research/research/voxtype.md new file mode 100644 index 0000000..0f08431 --- /dev/null +++ b/project_plans/agrapha-feature-research/research/voxtype.md @@ -0,0 +1,55 @@ +# VoxType — Research + +## Summary + +VoxType (712 stars, Rust, MIT) is a push-to-talk voice-to-text daemon for Linux/Wayland that holds a hotkey, records speech, and types the transcription at the cursor. It ships 7 transcription engines (Whisper + 6 ONNX models), a meeting mode with chunked recording and export to Markdown/JSON/SRT/VTT, Waybar status integration, and rich post-processing hooks. It is the closest architectural ancestor to where Agrapha could go for global dictation mode. + +## Feature Inventory + +- **Push-to-talk (hold) and toggle (press once)** via compositor keybindings (Hyprland, Sway, River) or evdev fallback (X11) +- **7 transcription engines** selectable at runtime via `--engine` flag or `config.toml`: + - Whisper (whisper.cpp, 99 languages; local, CLI subprocess, or remote HTTP) + - Parakeet (FastConformer TDT, ONNX, English) + - Moonshine (encoder-decoder, ONNX, English, edge-optimised) + - SenseVoice (CTC, ONNX, zh/en/ja/ko/yue) + - Paraformer (non-autoregressive, ONNX, zh+en bilingual) + - Dolphin (CTC E-Branchformer, ONNX, 40 languages + 22 Chinese dialects) + - Omnilingual (wav2vec2 CTC, ONNX, 1600+ languages) +- **Meeting mode** (`voxtype meeting start/stop/export/summarize`): continuous chunked transcription, speaker attribution (ML diarization in progress on `feature/fix-ml-diarization`), export to Markdown, plain text, JSON, SRT, VTT; AI summarization via Ollama +- **Waybar/polybar integration**: live recording-state JSON via `voxtype status --follow --format json`; extended output includes model, device, backend +- **Audio feedback**: start/stop/error sound cues; three built-in themes (default, subtle, mechanical), custom WAV directory +- **Text post-processing**: word replacements (`replacements = { "vox type" = "voxtype" }`), spoken punctuation (`spoken_punctuation = true` converts "period" → ".", "open paren" → "(") +- **Post-process command hook**: pipe transcription through any stdin→stdout command (Ollama, llama.cpp, LM Studio); timeout + graceful fallback to original +- **Output fallback chain**: wtype → dotool (XKB layout support) → ydotool → clipboard +- **On-demand model loading**: model loaded only when recording, saves RAM +- **Auto-submit**: optional Enter key after transcription (for chat/terminals) +- **Remote whisper.cpp server**: HTTP API backend for LAN inference offload +- **GPU acceleration**: Vulkan (AMD/NVIDIA/Intel), CUDA, Metal (build flags), HIP/ROCm +- **Multilingual**: auto-detect or force language; translation to English +- **Paste mode**: copies to clipboard then simulates Ctrl+V (for non-US keyboard layouts) + +## Architecture Notes + +- **Daemon model**: single foreground process, controlled via `voxtype record start/stop/toggle` subcommands that send SIGUSR1/SIGUSR2; compositor keybindings call these subcommands directly — no elevated permissions required on Wayland +- **Engine dispatch**: `Engine` enum dispatches to Whisper (whisper-rs crate) or ONNX Runtime (custom `onnx` feature flags per engine); ONNX engines are compile-time feature flags (`--features parakeet,moonshine,sensevoice,...`) +- **Audio**: cpal for cross-platform capture; PipeWire/PulseAudio on Linux +- **Meeting mode**: chunked continuous recording loop; segments timestamped; speaker embedding similarity clustering for diarization (TitaNet/ECAPA, 81 MB model, in-progress) +- **State file**: JSON written to a predictable path for Waybar polling +- **Config**: TOML at `~/.config/voxtype/config.toml`; full annotated default at `config/default.toml` +- **Relevant to Agrapha's JVM stack**: Voxtype is Rust — not directly portable — but its engine selection pattern (enum + feature-gated backends) and meeting-mode CLI design are directly transferable as design patterns + +## Agrapha Relevance + +| Feature | Rationale | +|---|---| +| **Engine selection with 7 backends** | Agrapha currently hard-codes Whisper.cpp JNI. Adding a Parakeet (ONNX via ONNX Runtime for Java) or Moonshine path would let users trade accuracy for speed on lower-power Macs. The engine enum pattern is directly adoptable. | +| **Meeting mode export to SRT/VTT/JSON** | Agrapha already exports Markdown/Logseq. SRT and VTT are standard subtitle formats that third-party tools (DaVinci, Premiere, Final Cut) can ingest. JSON export enables downstream automation. | +| **Spoken punctuation post-processing** | Developers who dictate code need punctuation. A lightweight regex replacement step (say "semicolon" → ";") would improve dictation UX for technical users at near-zero cost. | +| **Post-process LLM hook** | Agrapha already has Ollama/OpenAI integration for summaries; a per-dictation grammar-correction pass (like voxtype's `[output.post_process]` command hook) is a natural extension for dictation mode. | +| **Audio feedback themes** | Start/stop beeps reduce need to watch the UI. Low effort, high UX value. | +| **Push-to-talk / toggle modes** | Agrapha is meeting-first; a global-hotkey dictation mode (push-to-talk to any app) is voxtype's primary use case and a high-demand feature for Agrapha users. | +| **Status bar integration** | Agrapha currently has no persistent menu-bar indicator of recording state. A menu-bar extra showing recording status (inspired by voxtype's Waybar JSON) would improve discoverability. | + +## Attribution Note + +> Push-to-talk design, engine-selection architecture, and meeting export formats (SRT/VTT/JSON) inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT). From 48094b0592927883ea4a60fa37cee923290f93a7 Mon Sep 17 00:00:00 2001 From: Tyler Stapler Date: Sat, 9 May 2026 16:14:26 -0700 Subject: [PATCH 2/3] fix(research): address Copilot review comments on licenses, section order, and validation verdict Co-Authored-By: Claude Sonnet 4.6 --- .../implementation/plan.md | 30 ++++++++--------- .../implementation/validation.md | 33 ++++++++----------- .../research/comparable-projects.md | 10 +++--- 3 files changed, 34 insertions(+), 39 deletions(-) diff --git a/project_plans/agrapha-feature-research/implementation/plan.md b/project_plans/agrapha-feature-research/implementation/plan.md index f720493..a281957 100644 --- a/project_plans/agrapha-feature-research/implementation/plan.md +++ b/project_plans/agrapha-feature-research/implementation/plan.md @@ -13,7 +13,7 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M **Inspired by:** [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) **What they do:** Handy accepts a `custom_words` list that is injected as Whisper's `initial_prompt` parameter and as a Parakeet custom vocabulary, with fuzzy-match post-correction. WhisperWriter exposes `initial_prompt` directly as a config field for domain conditioning. **What Agrapha would do:** Allow users to define a persistent list of names, project codes, and technical terms; inject them as Whisper's `initial_prompt` via the existing JNI bridge so beam search favors those tokens, with optional fuzzy-match correction post-transcription. -**Attribution note (README):** Custom vocabulary / dictionary injection pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Attribution note (README):** Custom vocabulary / dictionary injection pattern inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (license unconfirmed). **Effort estimate:** S **Notes:** `WhisperParams` in Agrapha's JNI bridge already has an `initial_prompt` field — this is mostly UI + persistence (SQLDelight) work. Fuzzy-match post-correction is optional and can ship in a follow-up. @@ -46,7 +46,7 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M **Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) **What they do:** All three expose a boolean toggle between push-to-talk (hold key → recording, release → transcribe) and toggle modes (press once to start, press again to stop). WhisperWriter additionally offers a `continuous` mode (auto-restart after each segment) and a `voice_activity_detection` mode. **What Agrapha would do:** Add a `RecordingMode` enum (`MEETING_CONTINUOUS`, `TOGGLE`, `PUSH_TO_TALK`) to the settings UI. Meeting mode remains the default; toggle and push-to-talk are available for dictation use cases. VAD-based auto-stop can be a follow-up. -**Attribution note (README):** Toggle and push-to-talk recording mode design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Attribution note (README):** Toggle and push-to-talk recording mode design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), and [WhisperWriter](https://github.com/savbell/whisper-writer) (license unconfirmed). **Effort estimate:** S **Notes:** Requires a global hotkey listener on macOS (see Global Hotkey / Dictation Mode feature). The mode enum should be persisted in the existing settings store. @@ -68,7 +68,7 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M **Inspired by:** [WhisperWriter](https://github.com/savbell/whisper-writer) **What they do:** WhisperWriter passes the previous transcription chunk's text as Whisper's `initial_prompt` for the next chunk, reducing repetition artifacts and improving coherence across segment boundaries in continuous recordings. **What Agrapha would do:** In meeting (continuous) mode, automatically carry forward the last N words of the previous transcription segment as the Whisper `initial_prompt` for the next segment, improving transcript coherence without any user action. -**Attribution note (README):** Previous-chunk text conditioning for continuous transcription inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Attribution note (README):** Previous-chunk text conditioning for continuous transcription inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (license unconfirmed). **Effort estimate:** XS **Notes:** One-line change in the transcription loop to set `initial_prompt = last_segment_tail`. Synergizes with the custom vocabulary feature (both write to `initial_prompt`; concatenate both). Cap at ~224 tokens to stay within Whisper's context window. @@ -118,6 +118,17 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M --- +## Feature: Parakeet ONNX Engine +**Priority:** High +**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [Hex](https://github.com/kitlangton/Hex) +**What they do:** VoxType and Handy support NVIDIA Parakeet (FastConformer TDT) via ONNX Runtime as an alternative to Whisper, offering ~5× real-time throughput on CPU with comparable English accuracy and no GPU required. Meetily also supports Parakeet models (implementation details not confirmed in available source). Hex provides a production Swift implementation of dual-engine Parakeet+Whisper switching with a user-facing toggle on macOS (Apple Silicon). +**What Agrapha would do:** Add Parakeet as a selectable transcription engine via ONNX Runtime for Java (onnxruntime-java), allowing users on lower-power Macs or with large meeting backlogs to transcribe faster without waiting for Whisper large-v3. +**Attribution note (README):** Parakeet ONNX engine integration pattern inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [Handy](https://github.com/cjpais/Handy) (MIT); multi-engine toggle design pattern from [Hex](https://github.com/kitlangton/Hex) (MIT). +**Effort estimate:** L +**Notes:** ONNX Runtime has an official Java API (`com.microsoft.onnxruntime:onnxruntime`). Parakeet is English-only; diarization integration needs re-validation. Model download (~500 MB) must be handled gracefully. Abstract a `TranscriptionEngine` interface first so Whisper and Parakeet share a common caller. + +--- + ## MEDIUM PRIORITY --- @@ -133,23 +144,12 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M --- -## Feature: Parakeet ONNX Engine -**Priority:** High -**Inspired by:** [VoxType](https://github.com/peteonrails/voxtype), [Handy](https://github.com/cjpais/Handy), [Hex](https://github.com/kitlangton/Hex) -**What they do:** VoxType and Handy support NVIDIA Parakeet (FastConformer TDT) via ONNX Runtime as an alternative to Whisper, offering ~5× real-time throughput on CPU with comparable English accuracy and no GPU required. Meetily also supports Parakeet models (implementation details not confirmed in available source). Hex provides a production Swift implementation of dual-engine Parakeet+Whisper switching with a user-facing toggle on macOS (Apple Silicon). -**What Agrapha would do:** Add Parakeet as a selectable transcription engine via ONNX Runtime for Java (onnxruntime-java), allowing users on lower-power Macs or with large meeting backlogs to transcribe faster without waiting for Whisper large-v3. -**Attribution note (README):** Parakeet ONNX engine integration pattern inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT) and [Handy](https://github.com/cjpais/Handy) (MIT); multi-engine toggle design pattern from [Hex](https://github.com/kitlangton/Hex) (MIT). -**Effort estimate:** L -**Notes:** ONNX Runtime has an official Java API (`com.microsoft.onnxruntime:onnxruntime`). Parakeet is English-only; diarization integration needs re-validation. Model download (~500 MB) must be handled gracefully. Abstract a `TranscriptionEngine` interface first so Whisper and Parakeet share a common caller. - ---- - ## Feature: Silero VAD (Voice Activity Detection) **Priority:** Medium **Inspired by:** [Handy](https://github.com/cjpais/Handy), [WhisperWriter](https://github.com/savbell/whisper-writer) **What they do:** Handy uses a `SmoothedVad` wrapper over Silero VAD to trim leading/trailing silence from each audio chunk before sending to Whisper, reducing hallucinations and inference latency. WhisperWriter makes the silence duration configurable. **What Agrapha would do:** Integrate Silero VAD (ONNX model, ~1 MB) via ONNX Runtime for Java to detect and trim silence from each meeting audio chunk before Whisper inference, reducing hallucination artifacts and improving transcription of meeting segments with long pauses. -**Attribution note (README):** Silero VAD integration for silence trimming inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +**Attribution note (README):** Silero VAD integration for silence trimming inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (license unconfirmed). **Effort estimate:** M **Notes:** Silero VAD ONNX model is ~1 MB; inference is CPU-only and fast. Requires ONNX Runtime dependency (shared with Parakeet if that ships first). For meeting mode, VAD primarily reduces hallucination on silence; for dictation mode, it enables auto-stop. Both use cases justify the dependency. diff --git a/project_plans/agrapha-feature-research/implementation/validation.md b/project_plans/agrapha-feature-research/implementation/validation.md index 0d4b958..a31078c 100644 --- a/project_plans/agrapha-feature-research/implementation/validation.md +++ b/project_plans/agrapha-feature-research/implementation/validation.md @@ -10,11 +10,11 @@ Date: 2026-05-09 | Feature Area | Covered? | Backlog Item(s) | |---|---|---| | Push-to-talk / dictation mode | YES | "Toggle vs Push-to-Talk Recording Modes" (High), "Global Hotkey / Dictation Mode" (High) | -| Additional transcription engines beyond Whisper | PARTIAL — no High item | "Parakeet ONNX Engine" (Medium), "Moonshine Engine" (Low), "SenseVoice/Paraformer" (Low), "macOS Native Speech Framework" (Medium) | +| Additional transcription engines beyond Whisper | YES | "Parakeet ONNX Engine" (High), "Moonshine Engine" (Low), "SenseVoice/Paraformer" (Low), "macOS Native Speech Framework" (Medium) | | LLM integration patterns | YES | "Multiple Named LLM Post-Processing Prompts" (High), "One-Shot Speech-to-LLM" (Medium), "Apple Intelligence On-Device Post-Processing" (Medium) | | Export formats (Markdown, JSON, SRT, VTT) | YES | "SRT and VTT Export" (High), "JSON Export" (High) | -**Requirements gap:** Feature area 2 (additional transcription engines) has no High-priority backlog item. The requirements document states all four feature areas must be covered by at least one High-priority item. Parakeet ONNX Engine is the strongest candidate for promotion to High — it is the only alternative engine with a clear implementation path (ONNX Runtime for Java) and concrete evidence from three projects (VoxType, Handy, Meetily, plus the newly discovered Hex). +**Requirements coverage note:** Feature area 2 (additional transcription engines) is now covered by "Parakeet ONNX Engine" (High priority). Parakeet was promoted from Medium to High — it is the only alternative engine with a clear implementation path (ONNX Runtime for Java) and concrete evidence from three projects (VoxType, Handy, Meetily, plus the newly discovered Hex). All four feature areas now have at least one High-priority backlog item. --- @@ -88,25 +88,20 @@ Recommendation: Add a new Medium-priority backlog item — "Inline Transcript Co ## Verdict -**NEEDS REVISION** +**PASS** -The backlog requires the following changes before it is ready to use: +All 6 original issues fixed; validation complete. -**Must fix (blocking):** -1. Promote "Parakeet ONNX Engine" from Medium to High priority to satisfy the requirements coverage rule for feature area 2 (additional transcription engines). This is the only gap against the four required High-priority coverage areas. -2. Fix "Parakeet ONNX Engine" — "What they do" field overstates Meetily's Parakeet implementation as "ONNX Runtime" when that is unconfirmed (Issue 1). -3. Add BlahST to the "Global Hotkey / Dictation Mode" attribution note (Issue 2). -4. Add WhisperWriter to the "Silero VAD" attribution note (Issue 3). +All blocking and quality fixes from the original review have been applied: -**Should fix (quality):** -5. Resolve the whisper-mac license problem in "macOS Native Speech Framework Engine" attribution note (Issue 4). -6. Add WhisperKit credit to "SRT and VTT Export." -7. Add Hex credit to "Parakeet ONNX Engine." -8. Add a new "Inline Transcript Correction Editor" item (Medium priority) crediting noScribe. -9. Improve specificity of "macOS Menu Bar Recording Status Indicator" attribution note (Flag 3). +1. "Parakeet ONNX Engine" promoted from Medium to High priority — requirements coverage gap for feature area 2 resolved. +2. "Parakeet ONNX Engine" — "What they do" field corrected to no longer overstate Meetily's Parakeet implementation as "ONNX Runtime" (Issue 1 fixed). +3. BlahST added to the "Global Hotkey / Dictation Mode" attribution note (Issue 2 fixed). +4. WhisperWriter added to the "Silero VAD" attribution note (Issue 3 fixed). +5. whisper-mac license problem resolved in comparable-projects.md — section renamed to "Inspiration Reference" with explicit note that it is not for attribution; plan.md attribution note updated accordingly (Issue 4 fixed). +6. WhisperWriter license updated to "Unconfirmed (no LICENSE file in repo)" in comparable-projects.md and all "(MIT)" tags for WhisperWriter replaced with "(license unconfirmed)" in plan.md. **Count summary:** -- Attribution issues: 4 -- Items needing new discovery credits: 2 (SRT/VTT Export → WhisperKit; Parakeet ONNX Engine → Hex) -- Items that should exist but don't: 1 (Inline Transcript Correction Editor, from noScribe) -- Requirements gaps: 1 (no High-priority engine item) +- Attribution issues resolved: 4 +- License accuracy fixes: 2 (WhisperWriter unconfirmed; whisper-mac inspiration-only) +- Requirements gaps resolved: 1 (Parakeet ONNX Engine promoted to High) diff --git a/project_plans/agrapha-feature-research/research/comparable-projects.md b/project_plans/agrapha-feature-research/research/comparable-projects.md index 5843295..21ef0a9 100644 --- a/project_plans/agrapha-feature-research/research/comparable-projects.md +++ b/project_plans/agrapha-feature-research/research/comparable-projects.md @@ -2,7 +2,7 @@ ## Summary -Five additional open-source local-first STT and meeting-transcription projects were identified beyond the three seed projects. The most Agrapha-relevant are Meetily (meeting assistant closest in intent to Agrapha), OpenWhispr (macOS-native, VA + calendar integration, local diarization), and whisper-writer (four recording modes, VAD, continuous recording). All are MIT-licensed. Cloud-only tools and mobile-only apps were excluded. +Five additional open-source local-first STT and meeting-transcription projects were identified beyond the three seed projects. The most Agrapha-relevant are Meetily (meeting assistant closest in intent to Agrapha), OpenWhispr (macOS-native, VA + calendar integration, local diarization), and whisper-writer (four recording modes, VAD, continuous recording). Licenses are mixed: MIT for most projects (Meetily, OpenWhispr, open-wispr), GPL-3.0 for noScribe (patterns only — no code reuse), and unspecified for whisper-mac (inspiration only). Cloud-only tools and mobile-only apps were excluded. --- @@ -91,7 +91,7 @@ Meetily is the closest peer to Agrapha in intent (meeting minutes + summaries + **URL**: https://github.com/savbell/whisper-writer **Stars**: 1,049 **Language**: Python (PyQt5 GUI, faster-whisper) -**License**: MIT (implied — no LICENSE file found but standard open-source practices stated) +**License**: Unconfirmed (no LICENSE file in repo) **Platform**: Windows, macOS, Linux ### Feature Inventory @@ -119,7 +119,7 @@ Meetily is the closest peer to Agrapha in intent (meeting minutes + summaries + ### Attribution Note -> Continuous recording mode with previous-text conditioning and four recording mode design inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT). +> Continuous recording mode with previous-text conditioning and four recording mode design inspired by [WhisperWriter](https://github.com/savbell/whisper-writer) (license unconfirmed). --- @@ -180,9 +180,9 @@ Meetily is the closest peer to Agrapha in intent (meeting minutes + summaries + - **macOS native Speech framework as a fast/free engine**: SFSpeechRecognizer runs on-device (no download), supports English well, and is already optimised by Apple. Could be offered as the "quick start" engine before a user has downloaded a Whisper model. Latency is ~100–200 ms for short utterances - Note: SFSpeechRecognizer sends audio to Apple servers by default unless `requiresOnDeviceRecognition = true` is set (available iOS 13+ / macOS 12+). This restriction must be surfaced to users in Agrapha's privacy model -### Attribution Note +### Inspiration Reference -> macOS native Speech framework engine integration pattern noted from [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac). +> macOS native Speech framework engine integration pattern noted from [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac). License unspecified — inspiration only, not for attribution in README. --- From d672bb3146cb26b3189109dcb73f461afb46e67d Mon Sep 17 00:00:00 2001 From: Tyler Stapler Date: Sat, 9 May 2026 16:26:01 -0700 Subject: [PATCH 3/3] docs(research): add feature comparison table vs reference projects Co-Authored-By: Claude Sonnet 4.6 --- .../implementation/plan.md | 70 +++++++++++++++++++ 1 file changed, 70 insertions(+) diff --git a/project_plans/agrapha-feature-research/implementation/plan.md b/project_plans/agrapha-feature-research/implementation/plan.md index a281957..2fb0e66 100644 --- a/project_plans/agrapha-feature-research/implementation/plan.md +++ b/project_plans/agrapha-feature-research/implementation/plan.md @@ -4,6 +4,76 @@ Prioritized feature backlog derived from research into VoxType, BlahST, Handy, M --- +## Feature Comparison + +Agrapha's core differentiator is its end-to-end local pipeline: dual-channel audio capture (mic + system audio via ScreenCaptureKit JNI), Whisper.cpp inference with CoreML acceleration, optional pyannote.audio speaker diarization, LLM-backed transcript correction and summarization, and Logseq export — all without a single byte leaving the machine. No peer reviewed combines all these stages in a single macOS-native desktop app at zero cost. The primary gap is on the input side: Agrapha captures meetings only through its own UI button; it lacks a global hotkey, push-to-talk mode, voice activity detection, and dictation paste — features that Handy, VoxType, and Hex treat as table stakes. On the engine side, Agrapha is locked to Whisper while peers offer Parakeet, Moonshine, and native Speech Framework alternatives. On the export side, SRT/VTT subtitle formats and JSON are absent, limiting downstream video-editing and automation workflows. + +### Comparison Table + +Symbol key: `✅` fully implemented · `⚠️` partial / limited · `❌` not implemented · `—` not applicable to this project's scope + +| Feature | Agrapha (current) | VoxType | BlahST | Handy | Meetily | WhisperKit / Hex | WhisperWriter / noScribe | +|---|---|---|---|---|---|---|---| +| **Audio capture** | | | | | | | | +| Microphone recording | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| System audio capture | ✅ | — | — | — | ✅ | — | — | +| Push-to-talk / hotkey recording | ❌ | ✅ | ✅ | ✅ | — | ✅ | ✅ | +| Toggle mode (press once start, press once stop) | ❌ | ✅ | — | ✅ | — | ✅ | ✅ | +| Continuous dictation loop | ❌ | — | ✅ | — | — | — | ✅ | +| Voice Activity Detection (VAD / silence trimming) | ❌ | — | ✅ | ✅ | — | — | ✅ | +| **Transcription** | | | | | | | | +| Whisper engine | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | +| Parakeet engine | ❌ | ✅ | — | ✅ | ✅ | ✅ | — | +| Moonshine / SenseVoice / other ONNX engines | ❌ | ✅ | — | ✅ | — | — | — | +| Custom vocabulary / initial_prompt injection | ⚠️ | ✅ | — | ✅ | — | ✅ | ✅ | +| condition_on_previous_text (chunk conditioning) | ❌ | — | — | — | — | — | ✅ | +| Filler word stripping | ❌ | — | — | ✅ | — | — | — | +| Remote / server transcription endpoint | ❌ | ✅ | ✅ | — | — | — | ✅ | +| **Speaker and structure** | | | | | | | | +| Speaker diarization | ⚠️ | ⚠️ | — | — | ✅ | ✅ | ✅ | +| Speaker label correction (UI) | ❌ | — | — | — | — | — | ✅ | +| Inline transcript correction editor | ❌ | — | — | — | — | — | ✅ | +| **LLM integration** | | | | | | | | +| Transcript correction via LLM | ✅ | — | — | ✅ | — | — | — | +| Summarization (key points / decisions / action items) | ✅ | ⚠️ | — | — | ✅ | — | — | +| Multiple named prompts / prompt switching | ❌ | — | ✅ | ✅ | — | — | — | +| One-shot speech-to-LLM assistant | ❌ | — | ✅ | — | — | — | — | +| Apple Intelligence post-processing | ❌ | — | — | ✅ | — | — | — | +| **Export and integrations** | | | | | | | | +| Logseq export | ✅ | — | — | — | — | — | — | +| Obsidian / plain markdown | ❌ | — | — | — | — | — | — | +| JSON export | ❌ | ✅ | — | — | — | — | — | +| SRT subtitle export | ❌ | ✅ | — | — | — | ✅ | — | +| VTT subtitle export | ❌ | ✅ | — | — | — | ✅ | — | +| Transcription history / search | ✅ | — | — | ✅ | — | — | — | +| Meeting app auto-detection (Zoom / Teams auto-start) | ⚠️ | — | — | — | ✅ | — | — | +| Raycast / external app integration | ❌ | — | — | ✅ | — | — | — | +| MCP server | ❌ | — | — | — | ✅ | — | — | +| **UX and platform** | | | | | | | | +| macOS support | ✅ | — | — | ✅ | ✅ | ✅ | ✅ | +| Windows support | ❌ | — | — | ✅ | ✅ | — | ✅ | +| Linux support | ❌ | ✅ | ✅ | ✅ | ⚠️ | — | ✅ | +| Menu bar status indicator | ⚠️ | ✅ | — | ✅ | — | — | — | +| Audio feedback (start / stop sounds) | ❌ | ✅ | — | ✅ | — | — | — | +| Model auto-unload (idle memory reclaim) | ❌ | ✅ | — | ✅ | — | — | — | + +**Notes on Agrapha partial (⚠️) rows:** + +- **Custom vocabulary / initial_prompt injection** — `WhisperService.transcribe()` accepts an `initialPrompt` parameter and passes it to `WhisperFullParams.initialPrompt`; the default is hard-coded to `"This is a software engineering meeting."` and `AppSettings.whisperInitialPrompt` is persisted, but there is no UI to manage a per-user word list or inject custom terms alongside the meeting-type hint. +- **Speaker diarization** — The full `PyannoteDiarizationBackend` + `DiarizationService` pipeline is implemented and wired into `PipelineOrchestrator` stage 1; it requires a Python sidecar (`diarize_session.py`) and a Hugging Face token, and the toggle is in Settings. However speaker-label correction UI and real-time diarization are absent. +- **Meeting app auto-detection** — `MeetingDetector` polls for Zoom (`CptHost` process) and Google Meet (AppleScript window-title scan for Chrome/Edge/Arc/Brave) and exposes an `activeMeeting` flow; `AppSettings` has `autoRecordZoom` and `autoRecordGoogleMeet` toggles; but only Zoom and Google Meet are supported (no Teams, FaceTime, or Webex) and the auto-start is invite-only via UI confirmation, not fully autonomous. +- **Menu bar status indicator** — `MenuBarManager` is fully implemented using `java.awt.SystemTray` with idle/recording icon states, a popup menu, and tooltip text; on macOS the AWT tray icon renders as a clunky square icon in the menu bar rather than a native `NSStatusItem`, limiting visual quality. + +### What Agrapha Has That Peers Don't + +- **Dual-channel audio capture (mic + system audio simultaneously)** — No other reviewed project captures both channels at once and stores them as a stereo WAV with per-channel speaker labeling. Meetily captures both but without the per-channel JNI architecture. +- **Logseq-native export with journal backlinks** — Meeting pages are written as Logseq Markdown with `[[wikilinks]]`, timestamped block bullets, and a dated journal entry backlink. No peer integrates with a personal knowledge graph out of the box. +- **Structured summary: key points, decisions, and action items with owner + due-date fields** — Agrapha's `MeetingSummary` model and LLM parser extract structured action items including owner and due date. Peers that summarize produce plain prose or bullet lists without this structure. +- **Retranscription from stored audio** — The Transcript screen exposes a "Re-transcribe" action against the stored audio file, allowing users to apply a newer model or settings change to a past meeting without re-recording it. +- **Open-source, auditable, ELv2 licensed** — Every line of the capture, transcription, diarization, LLM, and export pipeline is in the repository. Competing macOS-native tools (Granola, MacWhisper, talat) are closed source. + +--- + ## HIGH PRIORITY ---