Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
430 changes: 430 additions & 0 deletions project_plans/agrapha-feature-research/implementation/plan.md

Large diffs are not rendered by default.

107 changes: 107 additions & 0 deletions project_plans/agrapha-feature-research/implementation/validation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Validation Report — Agrapha Feature Backlog

Validated against: `requirements.md`, `research/voxtype.md`, `research/blahst.md`, `research/handy.md`, `research/comparable-projects.md`
Date: 2026-05-09

---

## Requirements Coverage

| Feature Area | Covered? | Backlog Item(s) |
|---|---|---|
| Push-to-talk / dictation mode | YES | "Toggle vs Push-to-Talk Recording Modes" (High), "Global Hotkey / Dictation Mode" (High) |
| Additional transcription engines beyond Whisper | YES | "Parakeet ONNX Engine" (High), "Moonshine Engine" (Low), "SenseVoice/Paraformer" (Low), "macOS Native Speech Framework" (Medium) |
| LLM integration patterns | YES | "Multiple Named LLM Post-Processing Prompts" (High), "One-Shot Speech-to-LLM" (Medium), "Apple Intelligence On-Device Post-Processing" (Medium) |
| Export formats (Markdown, JSON, SRT, VTT) | YES | "SRT and VTT Export" (High), "JSON Export" (High) |

**Requirements coverage note:** Feature area 2 (additional transcription engines) is now covered by "Parakeet ONNX Engine" (High priority). Parakeet was promoted from Medium to High — it is the only alternative engine with a clear implementation path (ONNX Runtime for Java) and concrete evidence from three projects (VoxType, Handy, Meetily, plus the newly discovered Hex). All four feature areas now have at least one High-priority backlog item.

---

## Attribution Issues

4 issues found.

**Issue 1 — Parakeet ONNX Engine: Meetily's Parakeet implementation method overstated**
- Item: "Parakeet ONNX Engine"
- What's wrong: "What they do" states that VoxType, Handy, and Meetily all support Parakeet "via ONNX Runtime." The research confirms ONNX Runtime for VoxType and Handy. For Meetily, the research only says it supports "Whisper or Parakeet models" — the underlying runtime for Parakeet in Meetily is not confirmed. Meetily's architecture note specifies only "Rust whisper.cpp bindings for transcription; SortFormer ONNX for diarization." Claiming Meetily uses ONNX Runtime specifically for Parakeet is an overstatement.
- Suggested fix: Change "What they do" to: "VoxType and Handy support NVIDIA Parakeet (FastConformer TDT) via ONNX Runtime as an alternative to Whisper. Meetily also supports Parakeet models (implementation details not confirmed in available source)."

**Issue 2 — Global Hotkey / Dictation Mode: BlahST omitted from attribution note**
- Item: "Global Hotkey / Dictation Mode (Paste to Active App)"
- What's wrong: BlahST is listed in the "Inspired by" field and correctly described in "What they do," but it is absent from the "Attribution note (README)" field. The note credits VoxType, Handy, and open-wispr only.
- Suggested fix: Add BlahST to the attribution note: "...inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT), [Handy](https://github.com/cjpais/Handy) (MIT), [open-wispr](https://github.com/human37/open-wispr) (MIT), and [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT)."

**Issue 3 — Silero VAD: WhisperWriter omitted from attribution note**
- Item: "Silero VAD (Voice Activity Detection)"
- What's wrong: WhisperWriter is correctly listed in "Inspired by" and credited in "What they do" for making silence duration configurable. However, the attribution note credits only Handy.
- Suggested fix: Update attribution note to: "Silero VAD integration for silence trimming inspired by [Handy](https://github.com/cjpais/Handy) (MIT) and [WhisperWriter](https://github.com/savbell/whisper-writer) (MIT)."

**Issue 4 — macOS Native Speech Framework: whisper-mac has no license specified**
- Item: "macOS Native Speech Framework Engine (SFSpeechRecognizer)"
- What's wrong: The attribution note omits a license tag, but more importantly the research explicitly notes whisper-mac has "no license specified" and should be treated as "inspiration only, not for attribution." The backlog promotes it to a full attribution credit anyway.
- Suggested fix: Downgrade the attribution note to a softer acknowledgment — e.g., "macOS native Speech framework engine integration inspired by the engine inventory in [whisper-mac](https://github.com/Explosion-Scratch/whisper-mac) (license unspecified — no code reuse)." Alternatively, cite `SFSpeechRecognizer` as an Apple-documented API and drop the whisper-mac credit entirely, since the underlying technology is Apple's.

---

## Attribution Note Quality

3 items flagged.

**Flag 1 — Global Hotkey / Dictation Mode**: Attribution note is missing BlahST (see Issue 2 above). This makes the note inconsistent with the "Inspired by" field, which is the canonical credit record. Fix as described in Issue 2.

**Flag 2 — Silero VAD**: Attribution note is missing WhisperWriter (see Issue 3 above). Fix as described in Issue 3.

**Flag 3 — macOS Menu Bar Recording Status Indicator**: The note reads "Always-visible recording status indicator concept inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT)." The word "concept" is appropriately hedged (VoxType's implementation is Waybar/Linux-specific JSON), but the note would be more useful if it specified what the inspiration is: e.g., "Always-visible recording status indicator concept — specifically the real-time JSON state feed powering system bar integration — inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT)." Not a blocking issue, but improves README clarity.

All markdown link formats are syntactically correct (`[ProjectName](URL)` throughout). No broken or malformed links detected.

---

## New Discovery Integration

3 items require action.

**WhisperKit (argmax-oss-swift) — SRT/VTT export credit missing**

The "SRT and VTT Export" item credits only VoxType. WhisperKit has built-in SRT/VTT subtitle export as a first-class feature (`WhisperKit` produces word-level timestamps and serialises them to SRT/VTT). This is the most direct implementation reference for Agrapha's Kotlin serializer, since WhisperKit documents the data model explicitly (word timestamps → segment grouping → SRT formatting). Both items ("SRT and VTT Export" and "JSON Export") should be reviewed; JSON export is VoxType-only and WhisperKit does not produce JSON meeting exports, so JSON attribution is correct as-is.

Recommended change for "SRT and VTT Export":
- Add `[argmax-oss-swift (WhisperKit)](https://github.com/argmaxinc/argmax-oss-swift)` to "Inspired by"
- Update attribution note to: "SRT and VTT export format design inspired by [VoxType](https://github.com/peteonrails/voxtype) (MIT); subtitle data model (word timestamps → segment grouping) noted from [WhisperKit / argmax-oss-swift](https://github.com/argmaxinc/argmax-oss-swift) (MIT)."

**Hex (kitlangton) — Parakeet engine item should credit it**

The "Parakeet ONNX Engine" item credits VoxType, Handy, and Meetily. Hex provides a clean, production Swift implementation of dual-engine Parakeet+Whisper switching with a user-facing toggle — the most direct reference design for Agrapha's planned `TranscriptionEngine` abstraction. Since Hex is MIT-licensed and macOS-only (Apple Silicon), it is a closer platform match than VoxType (Linux) or Handy (multi-platform Tauri).

Recommended change for "Parakeet ONNX Engine":
- Add `[Hex](https://github.com/kitlangton/Hex)` to "Inspired by"
- Update attribution note to include: "...and [Hex](https://github.com/kitlangton/Hex) (MIT) for the multi-engine toggle design pattern."

**noScribe — no backlog item for transcript correction editor**

noScribe ships a dedicated correction editor (noScribeEdit) for reviewing and annotating transcripts word-by-word. There is no backlog item for an inline transcript editor in Agrapha. This is a gap relative to the research findings. noScribe's GPL-3.0 license means no code reuse, but the UX pattern (click a word, correct it, re-run a segment) is freely borrowable.

Recommendation: Add a new Medium-priority backlog item — "Inline Transcript Correction Editor" — inspired by noScribe, noting GPL-3.0 code restrictions. This fits naturally between the existing transcription history feature and the re-transcription feature, and aligns with Agrapha's meeting memory use case (correcting misheard names, technical terms, proper nouns before export).

---

## Verdict

**PASS**

All 6 original issues fixed; validation complete.

All blocking and quality fixes from the original review have been applied:

Comment on lines +89 to +96
1. "Parakeet ONNX Engine" promoted from Medium to High priority — requirements coverage gap for feature area 2 resolved.
2. "Parakeet ONNX Engine" — "What they do" field corrected to no longer overstate Meetily's Parakeet implementation as "ONNX Runtime" (Issue 1 fixed).
3. BlahST added to the "Global Hotkey / Dictation Mode" attribution note (Issue 2 fixed).
4. WhisperWriter added to the "Silero VAD" attribution note (Issue 3 fixed).
5. whisper-mac license problem resolved in comparable-projects.md — section renamed to "Inspiration Reference" with explicit note that it is not for attribution; plan.md attribution note updated accordingly (Issue 4 fixed).
6. WhisperWriter license updated to "Unconfirmed (no LICENSE file in repo)" in comparable-projects.md and all "(MIT)" tags for WhisperWriter replaced with "(license unconfirmed)" in plan.md.

**Count summary:**
- Attribution issues resolved: 4
- License accuracy fixes: 2 (WhisperWriter unconfirmed; whisper-mac inspiration-only)
- Requirements gaps resolved: 1 (Parakeet ONNX Engine promoted to High)
56 changes: 56 additions & 0 deletions project_plans/agrapha-feature-research/requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Agrapha Feature Research — Requirements

## Overview

Survey open-source local-first speech-to-text and meeting-transcription projects to build a prioritized feature backlog for Agrapha. Each backlog item must include an attribution note (project name + URL) for the README.

## Current State of Agrapha

- macOS desktop app (Kotlin Multiplatform + Compose Desktop)
- Records both mic and system audio via CoreAudio/ScreenCaptureKit JNI
- Transcribes locally with Whisper.cpp via JNI (Apple Neural Engine / CoreML)
- Optional speaker diarization via pyannote.audio
- Optional transcript correction + summarization via LLM (Ollama, OpenAI, Anthropic)
- Exports to Logseq (journal entries with [[links]])
- Outputs: key points, decisions, action items, full transcript
- Stack: Kotlin Multiplatform · Compose Desktop · SQLDelight · Whisper.cpp JNI

## Seed References (for README attribution)

| Project | URL | Stars | Language | Focus |
|---------|-----|-------|----------|-------|
| voxtype | https://github.com/peteonrails/voxtype | 712 | Rust | Push-to-talk STT for Linux/Wayland; 7 engines; meeting mode |
| BlahST | https://github.com/QuantiusBenignus/BlahST | 172 | Shell | Lean whisper.cpp wrapper; LLM integration; continuous dictation |
| Handy | https://github.com/cjpais/Handy | 21364 | Rust/Tauri | Cross-platform desktop STT; VAD; Parakeet; extensible |

Research should also discover 3–5 additional comparable projects.

## Feature Areas of Interest

1. **Push-to-talk / dictation mode** — hold-key-to-record anywhere on screen (not just during meetings)
2. **Additional transcription engines** — Parakeet, Moonshine, SenseVoice, Paraformer beyond Whisper
3. **LLM integration patterns** — speech-to-LLM chat, AI proofreader, one-shot assistant, TTS responses
4. **Export formats** — Markdown, JSON, SRT, VTT (beyond current Logseq/plain-markdown)

## Deliverable

A **prioritized feature backlog** structured as:

```
## Feature: <name>
**Priority:** High / Medium / Low
**Inspired by:** <Project>(s) with URL(s)
**What they do:** <1–2 sentences>
**What Agrapha would do:** <1–2 sentences scoped to Agrapha's macOS/meeting context>
**Attribution note (README):** <exact text to add>
**Effort estimate:** XS / S / M / L / XL
```

Items should be ordered: High priority first, then Medium, then Low.

## Constraints

- Agrapha is macOS-only (Intel + Apple Silicon); Linux-specific features are inspiration only
- Core value proposition is meeting transcription + memory system export — features must fit that context or expand it coherently
- Attribution accuracy matters: only credit a project for things they actually do
- Research agent must also search for comparable projects not in the seed list
72 changes: 72 additions & 0 deletions project_plans/agrapha-feature-research/research/blahst.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# BlahST — Research

## Summary

BlahST (172 stars, zsh shell scripts, MIT) is a minimal Linux STT toolkit built on top of whisper.cpp. It provides six composable scripts — each a hotkey-triggered shell one-liner — covering basic transcription, multilingual input, continuous dictation, LLM chat, and streaming speech-to-speech conversation. Its design philosophy is zero-UI ("the best UI is no UI at all"), using the clipboard or PRIMARY selection as the universal paste mechanism. The LLM integration pattern (speech → transcription → llama-server → Piper TTS) is directly relevant to Agrapha's planned AI features.

## Feature Inventory

### Scripts

| Script | Function |
|--------|----------|
| `wsi` | Core STT: record mic via sox, transcribe via local whisper.cpp or whisper.cpp server API or whisperfile, output to clipboard or PRIMARY selection. Silence-detection stops recording automatically (~2 s of silence at 6% threshold). Prevents duplicate invocations via `pidof` guard. |
| `wsiml` | Multilingual variant of `wsi`: `-l fr` forces language, `-t` translates to English. Supports multiple hotkeys per language. |
| `wsiAI` | One-shot LLM assistant: transcribes speech, constructs prompt ("Assistant …", "Translator …", "Computer, proofread …"), sends to llama-server or llamafile, gets text response, speaks via Piper TTS, and places answer in clipboard. |
| `blooper` | Continuous dictation loop: transcribes in a loop, autopastes on each pause (~3 s silence threshold), exits on longer silence or hotkey interrupt; uses xdotool/ydotool for autopaste. |
| `blahstbot` | Low-latency speech chat: record → whisper.cpp → llama-server → Piper TTS → spoken + clipboard response. Context persists between turns; "RESET CONTEXT" spoken command clears it. `-n` flag uses LAN server for offload. |
| `blahstream` | Streaming speech-to-speech chat: like `blahstbot` but LLM response streamed token-by-token; each chunk spoken and autopasted in real time. Context compression via summarization. X11: buffers paste until original window regains focus. WIP/experimental. |

### Additional Features

- **whisperfile support**: portable single-file whisper model executables (`-w` flag); no compilation needed
- **AI proofreader**: triggered by keyword in speech ("Computer, proofread …"); sends currently selected text to LLM, replaces selection with corrected version — no clipboard interaction required
- **AI translator**: "Translator …" spoken keyword → LLM translates text, speaks result, places in clipboard
- **Clipboard vs. PRIMARY selection**: `wsi` (no flag) uses clipboard (Ctrl+V paste); `wsi -p` uses PRIMARY selection (middle-mouse paste); both support X11 and Wayland
- **Network transcription**: whisper.cpp HTTP server API (`-n` or explicit `IP:PORT` argument); sub-150 ms round-trip on LAN
- **Autopaste**: xdotool (X11) or ydotool (Wayland) simulates Ctrl+V or middle-click after transcription
- **Single-instance guard**: `pidof -q <scriptname> && exit 0` prevents duplicate hotkey presses from stacking
- **Hotkey interrupt**: `pkill rec` as a second hotkey cancels in-flight recording immediately
- **Piper TTS**: neural local TTS for spoken LLM responses; no cloud, supports multiple voice models
- **Centralised config**: all tools share `blahst.cfg`; local per-script overrides possible
- **Microphone indicator**: GNOME top-bar mic icon appears for duration of recording (uses system desktop notification mechanism)

## LLM Integration Approach

BlahST treats each LLM interaction as a one-shot or stateful HTTP call to a locally-running `llama-server` (llama.cpp) or a `llamafile` executable:

1. **Speech capture**: `rec` (sox) captures at 16 kHz until silence
2. **Transcription**: `whisper-cli` (local) or HTTP POST to `/inference` (server/whisperfile)
3. **Prompt construction**: shell string interpolation; keyword in speech ("Computer …", "Assistant …") determines which system prompt to prepend
4. **LLM call**: `curl` POST to `llama-server` API; streaming (`blahstream`) or one-shot (`blahstbot`)
5. **TTS output**: response piped to `piper` for local neural TTS; output played via `aplay`
6. **Clipboard/paste**: response also written to clipboard for manual paste

Context management in `blahstbot` is manual: conversation history held in a shell array, serialised to JSON, truncated or summarised when it grows too large.

## Continuous Dictation Design

`blooper` uses sox silence detection (`silence 1 0.1 1% 1 2.0 5%`) to end each segment, then immediately autopastes and restarts recording. The loop exits on ≥3 s silence or hotkey interrupt (`pkill rec`). Text accumulates at the keyboard caret via xdotool/ydotool. This is entirely within the terminal/shell — no GUI required.

## Architecture Notes

- Pure zsh scripts (~100–300 lines each); no compiled components beyond whisper.cpp and llama.cpp
- All IPC via clipboard, PRIMARY selection, and signals (`pkill`)
- Stateless between invocations (except `blahstbot` conversation array)
- Not portable to macOS as-is (sox flags, xdotool/ydotool, xsel/wl-copy are Linux-specific), but the **design patterns** are portable
- Relevant to Agrapha: the clipboard-as-universal-output pattern, the one-shot LLM prompt construction pattern, and the keyword-triggered AI proofreader concept all map cleanly to Kotlin/Compose

## Agrapha Relevance

| Feature | Rationale |
|---|---|
| **Continuous dictation mode** (`blooper`) | Agrapha is meeting-first but users want always-on dictation between meetings. `blooper`'s loop-until-silence pattern is the reference design: record → transcribe → paste → repeat. Could be Agrapha's "Dictation Mode" toggle. |
| **AI proofreader triggered by selected text** | Select text in any app, say "proofread", and it's replaced. This would complement Agrapha's existing LLM integration at near-zero engineering cost (services layer already exists). |
| **One-shot speech-to-LLM** (`wsiAI` pattern) | User dictates a question; Agrapha sends it to configured LLM; speaks/displays the answer. Natural extension of the existing Ollama/Anthropic/OpenAI integration. |
| **Keyword-dispatch to different LLM prompts** | Spoken prefixes ("Summarise …", "Action items …", "Draft email …") can route to different pre-configured system prompts. Low engineering cost, high UX value in a meeting context. |
| **Network transcription offload** | Agrapha runs on the meeting device; a second Mac or a home server could run whisper-server for faster inference. BlahST's `-n` pattern is the reference for how to add a remote backend. |
| **Piper TTS for spoken responses** | Agrapha could speak AI-generated summaries or action items at meeting end. Piper runs fully locally; the macOS analogue is `say` (built-in) or a native neural TTS. |

## Attribution Note

> Continuous dictation loop design and speech-to-LLM one-shot pattern inspired by [BlahST](https://github.com/QuantiusBenignus/BlahST) (MIT).
Loading
Loading