feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177
feat(stt): local Whisper transcription backend + transcribe_audio worker tool#177Marenz wants to merge 4 commits intospacedriveapp:mainfrom
Conversation
139bb52 to
d9b8764
Compare
When routing.voice = "whisper-local://<spec>", audio attachments are transcribed locally instead of via the LLM provider HTTP path. <spec> is either: - A known size name (tiny/base/small/medium/large) — fetched from ggerganov/whisper.cpp on HuggingFace via hf-hub, using the existing HF cache if already present - An absolute path to a GGML model file The WhisperContext is loaded once and cached in a OnceLock for the process lifetime. Audio decoding (ogg, opus, mp3, flac, wav, m4a) is handled by symphonia with linear resampling to 16 kHz mono f32. All three deps (whisper-rs, hf-hub, symphonia) are optional behind the stt-whisper feature flag.
Workers can now call transcribe_audio(path) to transcribe local audio files. The tool uses whatever is configured in routing.voice — local Whisper (whisper-local://<spec>) or any OpenAI-compatible HTTP provider. The transcription logic is extracted from channel.rs into stt.rs as transcribe_bytes(), shared by both the channel attachment handler and the new tool. The stt module is now always compiled (not gated on stt-whisper) since it handles all provider paths.
37d9e0d to
89b57ea
Compare
WalkthroughIntroduces Speech-to-Text (STT) transcription feature supporting local Whisper and OpenAI-compatible providers, along with extensive agent channel enhancements including worker/branch spawning APIs, concurrent worker limits, attachment download/transcription, history management, and user message formatting with attribution context. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/agent/channel.rs (1)
3-9:⚠️ Potential issue | 🔴 CriticalRemove the duplicated helper block and fix stale ProcessEvent payloads in
channel.rs.This file imports from
channel_attachments,channel_dispatch, andchannel_historymodules (lines 3–9) but also redefines those same helpers locally:spawn_memory_persistence_branch(1830),extract_reply_from_tool_syntax(2223),format_user_message(2277),event_is_for_channel(2343),download_attachments(2381), andapply_history_after_turn(2585). Additionally, ProcessEvent payloads at lines 2051 and 2149 are missing the requiredworker_typefield, and line 2208 is missing the requiredsuccessfield—these will fail to compile. Delete the entire duplicated helper block (1800–2843) and keep the canonical implementations in the separate modules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/channel.rs` around lines 3 - 9, Remove the entire duplicated helper block that reimplements spawn_memory_persistence_branch, extract_reply_from_tool_syntax, format_user_message (and format_batched_user_message if present), event_is_for_channel, download_attachments, and apply_history_after_turn so the file uses the canonical implementations imported at the top; after deleting that block, verify there are no remaining local definitions or duplicate symbols and update any call sites to use the imported functions. Also scan this file for any ProcessEvent payload constructions and ensure they include the required fields (add worker_type and add success where missing) so payload shapes match the canonical ProcessEvent type, and run a compile to catch any unresolved references from the deletion.
🧹 Nitpick comments (2)
src/stt.rs (2)
310-313: UsePath::is_absolute()for absolute-path detection.
spec.starts_with('/')is Unix-specific.std::path::Path::is_absolute()is the correct path check across platforms.♻️ Suggested fix
- if spec.starts_with('/') { - if std::path::Path::new(spec).exists() { + if std::path::Path::new(spec).is_absolute() { + if std::path::Path::new(spec).exists() { return Ok(spec.to_owned()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/stt.rs` around lines 310 - 313, Replace the Unix-specific check spec.starts_with('/') with the cross-platform Path::is_absolute() idiom: use std::path::Path::new(spec).is_absolute() in the same conditional that also calls Path::new(spec).exists(), updating the branch in the function/block that handles the spec variable so it returns Ok(spec.to_owned()) when the path is absolute and exists (look for the spec.starts_with('/') check near the Path::new(spec).exists() call in src/stt.rs).
251-523: Expand abbreviated variable names in local decoding/transcription paths.Names like
ctx,n,dec, andmssreduce readability in already dense logic. Prefer explicit names in this module.As per coding guidelines, "Do not abbreviate variable names; use
queuenotq,messagenotmsg,channelnotch; common abbreviations likeconfigare acceptable".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/stt.rs` around lines 251 - 523, The review asks to replace abbreviated local variable names with explicit, descriptive names to improve readability: in transcribe_blocking rename ctx -> context and n -> segment_count (and any other short names in that function); in get_or_load_context keep variable names explicit (e.g., ctx -> context) when caching/loading; in decode_to_f32 rename mss -> media_source_stream, decoded -> decoded_packet_or_buffer (or decoded_buffer), and track-related locals to descriptive names (e.g., track_id, sample_rate, channel_count are fine); in decode_ogg_opus rename dec -> opus_decoder and pcm -> pcm_buffer and samples -> mono_samples; and in resample keep names clear (e.g., ratio, output_len, output). Update all usages consistently within each function (transcribe_blocking, get_or_load_context, resolve_model_path, decode_to_f32, is_ogg_opus, decode_ogg_opus, resample) and run cargo check to ensure no missing renames.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib.rs`:
- Around line 27-28: The crate root currently conditionally declares the stt
module with #[cfg(feature = "stt-whisper")], but other files (e.g., references
to crate::stt in transcribe_audio.rs and channel.rs) import it unconditionally;
remove the feature gate so the module is always declared (change #[cfg(feature =
"stt-whisper")] pub mod stt; to pub mod stt;), and if needed keep internal cfgs
inside the stt module implementation to select whisper-specific code paths while
preserving the unconditional module export.
In `@src/stt.rs`:
- Line 300: The call to CONTEXT.set((model_spec.to_owned(), ctx)) currently
discards the Result with let _ =; instead explicitly handle the failure from
OnceLock::set by checking its return (or catching the Err) and logging a
debug-level message that a race occurred and the context was already initialized
(include model_spec info), or propagate the error if appropriate; reference
CONTEXT, OnceLock::set, model_spec and ctx to locate the site and ensure the
branch still retrieves the existing value when set fails.
- Around line 454-499: The loop currently swallows reader.read_packet() errors
and uses decoder.as_mut().unwrap() and abbreviated names; change the packet loop
to fully match reader.read_packet() and on Err(e) return
Err(WhisperError::Decode(e.to_string())), on Ok(None) break; replace
decoder.as_mut().unwrap() with a checked access that returns
Err(WhisperError::Decode("missing decoder".into())) if decoder is None; rename
local variables: dec -> decoder_mut or decoder_ref and n -> num_samples, and
rename inner ch -> channel to avoid abbreviations; ensure the decode_float error
is propagated as WhisperError::Decode and update subsequent uses to the new
variable names when extending samples or mixing down to mono.
In `@src/tools.rs`:
- Around line 434-440: Whitespace-only voice_model values cause
TranscribeAudioTool to be registered then fail; trim the voice_model before the
emptiness check and use the trimmed value when constructing TranscribeAudioTool.
Specifically, replace the current check of voice_model.is_empty() with a check
of a trimmed string (e.g., let trimmed = voice_model.trim(); if
trimmed.is_empty() { ... }) and pass that trimmed/owned value into
TranscribeAudioTool::new (the server.tool(TranscribeAudioTool::new(...)) call)
so only non-empty, non-whitespace model names register.
In `@src/tools/transcribe_audio.rs`:
- Around line 24-35: TranscribeAudioTool currently allows the model to supply
arbitrary host paths; update the file-reading logic in the transcribe_audio
method (and the similar code at the other site referenced around lines 83-87) to
enforce workspace boundaries by canonicalizing the requested path and the
workspace root (resolve symlinks), then ensure the canonicalized requested path
is inside the canonicalized workspace root (starts_with); if not, return an
error/deny access and do not read the file. Use the existing TranscribeAudioTool
method and struct members (or add a workspace_root field if needed) to perform
this check before any filesystem access.
- Around line 84-87: Before reading the audio file into memory in
transcribe_audio.rs, check the file size via tokio::fs::metadata(&args.path) (or
equivalent) and enforce a configurable size cap (e.g., MAX_AUDIO_BYTES
constant); if metadata.len() exceeds the cap return a TranscribeAudioError
indicating the file is too large instead of calling tokio::fs::read. Update the
code paths around args.path and the TranscribeAudioError usage so you only read
into memory after the size check and ensure the error message includes the
actual file size and the allowed limit.
---
Outside diff comments:
In `@src/agent/channel.rs`:
- Around line 3-9: Remove the entire duplicated helper block that reimplements
spawn_memory_persistence_branch, extract_reply_from_tool_syntax,
format_user_message (and format_batched_user_message if present),
event_is_for_channel, download_attachments, and apply_history_after_turn so the
file uses the canonical implementations imported at the top; after deleting that
block, verify there are no remaining local definitions or duplicate symbols and
update any call sites to use the imported functions. Also scan this file for any
ProcessEvent payload constructions and ensure they include the required fields
(add worker_type and add success where missing) so payload shapes match the
canonical ProcessEvent type, and run a compile to catch any unresolved
references from the deletion.
---
Nitpick comments:
In `@src/stt.rs`:
- Around line 310-313: Replace the Unix-specific check spec.starts_with('/')
with the cross-platform Path::is_absolute() idiom: use
std::path::Path::new(spec).is_absolute() in the same conditional that also calls
Path::new(spec).exists(), updating the branch in the function/block that handles
the spec variable so it returns Ok(spec.to_owned()) when the path is absolute
and exists (look for the spec.starts_with('/') check near the
Path::new(spec).exists() call in src/stt.rs).
- Around line 251-523: The review asks to replace abbreviated local variable
names with explicit, descriptive names to improve readability: in
transcribe_blocking rename ctx -> context and n -> segment_count (and any other
short names in that function); in get_or_load_context keep variable names
explicit (e.g., ctx -> context) when caching/loading; in decode_to_f32 rename
mss -> media_source_stream, decoded -> decoded_packet_or_buffer (or
decoded_buffer), and track-related locals to descriptive names (e.g., track_id,
sample_rate, channel_count are fine); in decode_ogg_opus rename dec ->
opus_decoder and pcm -> pcm_buffer and samples -> mono_samples; and in resample
keep names clear (e.g., ratio, output_len, output). Update all usages
consistently within each function (transcribe_blocking, get_or_load_context,
resolve_model_path, decode_to_f32, is_ogg_opus, decode_ogg_opus, resample) and
run cargo check to ensure no missing renames.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lock,!**/*.lockCargo.tomlis excluded by!**/*.toml
📒 Files selected for processing (9)
README.mdprompts/en/tools/transcribe_audio_description.md.j2src/agent/channel.rssrc/agent/worker.rssrc/lib.rssrc/prompts/text.rssrc/stt.rssrc/tools.rssrc/tools/transcribe_audio.rs
| #[cfg(feature = "stt-whisper")] | ||
| pub mod stt; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check for the stt module declaration and feature gates in src/lib.rs
rg -n 'pub mod stt|cfg\(feature = "stt-whisper"\)' src/lib.rsRepository: spacedriveapp/spacebot
Length of output: 116
🏁 Script executed:
# Find all references to crate::stt throughout the codebase
rg -n 'crate::stt' --type rust -A 2Repository: spacedriveapp/spacebot
Length of output: 1258
🏁 Script executed:
# Also check for any use statements that might reference stt
rg -n 'use.*stt' --type rustRepository: spacedriveapp/spacebot
Length of output: 48
🏁 Script executed:
# Check the context around the call sites to see if they're gated
sed -n '85,100p' src/tools/transcribe_audio.rsRepository: spacedriveapp/spacebot
Length of output: 652
🏁 Script executed:
# Check if the files containing call sites have any feature gating
head -20 src/tools/transcribe_audio.rs
head -30 src/agent/channel.rsRepository: spacedriveapp/spacebot
Length of output: 1835
🏁 Script executed:
# Verify if there are conditional compilations around those call sites
sed -n '2490,2515p' src/agent/channel.rsRepository: spacedriveapp/spacebot
Length of output: 1269
🏁 Script executed:
# Check if the files with call sites are themselves feature-gated
head -1 src/tools/transcribe_audio.rs
grep -n '#\[cfg' src/tools/transcribe_audio.rs | head -5Repository: spacedriveapp/spacebot
Length of output: 104
🏁 Script executed:
# Check if transcribe_audio.rs is conditionally compiled
grep -B5 'pub struct TranscribeAudioTool' src/tools/transcribe_audio.rsRepository: spacedriveapp/spacebot
Length of output: 201
🏁 Script executed:
# Check the tools.rs file to see how transcribe_audio is included
cat src/tools.rs | head -50Repository: spacedriveapp/spacebot
Length of output: 1689
🏁 Script executed:
# Check if the containing functions/methods are feature-gated
sed -n '2480,2520p' src/agent/channel.rs | grep -B20 'crate::stt'Repository: spacedriveapp/spacebot
Length of output: 1206
🏁 Script executed:
# Check tools.rs to see if transcribe_audio is included unconditionally
grep -n 'transcribe_audio' src/tools.rsRepository: spacedriveapp/spacebot
Length of output: 126
🏁 Script executed:
# Check if there's a Cargo.toml to understand the stt-whisper feature
cat Cargo.toml | head -100Repository: spacedriveapp/spacebot
Length of output: 2464
🏁 Script executed:
# Check if transcribe_audio tool usage is conditionally compiled elsewhere
grep -rn 'TranscribeAudioTool' --type rust src/Repository: spacedriveapp/spacebot
Length of output: 183
🏁 Script executed:
# Check if the stt module itself is properly structured to handle the feature gate gracefully
head -30 src/stt.rs 2>/dev/null || echo "stt.rs not found, checking for stt/"Repository: spacedriveapp/spacebot
Length of output: 1207
🏁 Script executed:
# Check if there's a stt/ directory (module)
ls -la src/ | grep -i sttRepository: spacedriveapp/spacebot
Length of output: 124
🏁 Script executed:
# Check the full transcribe_bytes function signature in stt.rs
sed -n '30,80p' src/stt.rsRepository: spacedriveapp/spacebot
Length of output: 1745
🏁 Script executed:
# Check how transcribe_bytes handles the case when stt-whisper is not enabled
grep -n 'pub.*fn transcribe_bytes' src/stt.rs -A 20Repository: spacedriveapp/spacebot
Length of output: 773
🏁 Script executed:
# Verify that the HTTP provider path is unconditionally available
grep -n 'http\|Http\|provider' src/stt.rs | head -20Repository: spacedriveapp/spacebot
Length of output: 1172
🏁 Script executed:
# Verify that stt.rs itself is NOT feature-gated at the module level
head -5 src/stt.rsRepository: spacedriveapp/spacebot
Length of output: 289
🏁 Script executed:
# Check if there's a mod declaration for stt in src/lib.rs and if it matches the cfg gate
sed -n '20,35p' src/lib.rsRepository: spacedriveapp/spacebot
Length of output: 340
🏁 Script executed:
# Confirm that transcribe_audio.rs has no feature gates around the call site
sed -n '1,15p' src/tools/transcribe_audio.rs && sed -n '80,100p' src/tools/transcribe_audio.rsRepository: spacedriveapp/spacebot
Length of output: 1217
🏁 Script executed:
# Double check: try to understand if tools.rs or tools/transcribe_audio.rs is itself feature gated
grep -n '#\[cfg' src/tools.rs
grep -n '#\[cfg' src/tools/transcribe_audio.rsRepository: spacedriveapp/spacebot
Length of output: 82
🏁 Script executed:
# Check what happens in agent/channel.rs - is that file feature gated?
head -50 src/agent/channel.rs | grep -n '#\[cfg'Repository: spacedriveapp/spacebot
Length of output: 48
🏁 Script executed:
# Final verification: confirm build would fail without stt-whisper
# by checking if crate::stt is definitely called in non-gated contexts
echo "=== Checking transcribe_audio.rs context ===" && \
sed -n '1,20p' src/tools/transcribe_audio.rs | grep -c 'cfg' && \
echo "=== Checking if TranscribeAudioTool is exposed in tools.rs ===" && \
sed -n '50,110p' src/tools.rs | grep -i transcribeRepository: spacedriveapp/spacebot
Length of output: 112
Remove the feature gate from the stt module declaration at the crate root.
The stt module is gated with #[cfg(feature = "stt-whisper")], but call sites in src/tools/transcribe_audio.rs and src/agent/channel.rs reference crate::stt unconditionally. Since the stt module itself supports builds without the feature (dispatching to HTTP providers), the module should be unconditionally available.
🔧 Proposed fix
-#[cfg(feature = "stt-whisper")]
pub mod stt;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[cfg(feature = "stt-whisper")] | |
| pub mod stt; | |
| pub mod stt; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib.rs` around lines 27 - 28, The crate root currently conditionally
declares the stt module with #[cfg(feature = "stt-whisper")], but other files
(e.g., references to crate::stt in transcribe_audio.rs and channel.rs) import it
unconditionally; remove the feature gate so the module is always declared
(change #[cfg(feature = "stt-whisper")] pub mod stt; to pub mod stt;), and if
needed keep internal cfgs inside the stt module implementation to select
whisper-specific code paths while preserving the unconditional module export.
| let ctx = WhisperContext::new_with_params(&model_path, params) | ||
| .map_err(|e| WhisperError::Load(e.to_string()))?; | ||
|
|
||
| let _ = CONTEXT.set((model_spec.to_owned(), ctx)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the context around line 300 in src/stt.rs
head -n 310 src/stt.rs | tail -n 30Repository: spacedriveapp/spacebot
Length of output: 1124
🏁 Script executed:
# Search for CONTEXT definition in the file to understand its type
rg "CONTEXT" src/stt.rs -B 2 -A 2Repository: spacedriveapp/spacebot
Length of output: 835
🏁 Script executed:
# Check for other uses of OnceLock or similar patterns in the codebase
rg "OnceLock" src/ -A 2 -B 2Repository: spacedriveapp/spacebot
Length of output: 2028
Explicitly handle OnceLock::set failure instead of silently discarding it.
Line 300 silently discards the result of CONTEXT.set() via let _ =. Per coding guidelines, errors should be handled, logged, or propagated—not silently discarded. While the code handles the failure implicitly by retrieving the already-set value, the race condition should be acknowledged explicitly at debug level.
🔧 Suggested fix
- let _ = CONTEXT.set((model_spec.to_owned(), ctx));
+ if CONTEXT.set((model_spec.to_owned(), ctx)).is_err() {
+ tracing::debug!("Whisper context was initialized concurrently; using existing context");
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let _ = CONTEXT.set((model_spec.to_owned(), ctx)); | |
| if CONTEXT.set((model_spec.to_owned(), ctx)).is_err() { | |
| tracing::debug!("Whisper context was initialized concurrently; using existing context"); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/stt.rs` at line 300, The call to CONTEXT.set((model_spec.to_owned(),
ctx)) currently discards the Result with let _ =; instead explicitly handle the
failure from OnceLock::set by checking its return (or catching the Err) and
logging a debug-level message that a race occurred and the context was already
initialized (include model_spec info), or propagate the error if appropriate;
reference CONTEXT, OnceLock::set, model_spec and ctx to locate the site and
ensure the branch still retrieves the existing value when set fails.
| while let Ok(Some(packet)) = reader.read_packet() { | ||
| if header_packets < 2 { | ||
| if header_packets == 0 { | ||
| // Parse OpusHead to get channel count and pre-skip. | ||
| if packet.data.len() >= 11 && &packet.data[0..8] == b"OpusHead" { | ||
| channels = packet.data[9] as usize; | ||
| // Output sample rate is always 48000 for libopus. | ||
| sample_rate = 48000; | ||
| } | ||
| decoder = Some( | ||
| opus::Decoder::new( | ||
| sample_rate, | ||
| if channels == 2 { | ||
| opus::Channels::Stereo | ||
| } else { | ||
| opus::Channels::Mono | ||
| }, | ||
| ) | ||
| .map_err(|e| WhisperError::Decode(e.to_string()))?, | ||
| ); | ||
| } | ||
| header_packets += 1; | ||
| continue; | ||
| } | ||
|
|
||
| let dec = decoder.as_mut().unwrap(); | ||
| // Max Opus frame: 120ms at 48kHz = 5760 samples per channel. | ||
| let max_samples = 5760 * channels; | ||
| let mut pcm = vec![0f32; max_samples]; | ||
| let n = dec | ||
| .decode_float(&packet.data, &mut pcm, false) | ||
| .map_err(|e| WhisperError::Decode(e.to_string()))?; | ||
|
|
||
| // Mix down to mono. | ||
| if channels == 1 { | ||
| samples.extend_from_slice(&pcm[..n]); | ||
| } else { | ||
| for frame in 0..n { | ||
| let mut sum = 0f32; | ||
| for ch in 0..channels { | ||
| sum += pcm[frame * channels + ch]; | ||
| } | ||
| samples.push(sum / channels as f32); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the actual code around lines 454-499 in src/stt.rs
wc -l src/stt.rsRepository: spacedriveapp/spacebot
Length of output: 80
🏁 Script executed:
# Read the specific section mentioned in the review
sed -n '440,510p' src/stt.rsRepository: spacedriveapp/spacebot
Length of output: 2858
🏁 Script executed:
# Also check the OnceLock issue mentioned at line 300
sed -n '295,305p' src/stt.rsRepository: spacedriveapp/spacebot
Length of output: 448
🏁 Script executed:
# Check for abbreviated variable names throughout the file
rg '\b(ctx|dec|n|mss)\b' src/stt.rs -nRepository: spacedriveapp/spacebot
Length of output: 866
Propagate Ogg packet read failures and remove panic path in decoder access.
Line 454 silently drops read_packet() errors by exiting the loop, and line 479 can panic on malformed headers via unwrap(). Both violate error-handling requirements and should return WhisperError::Decode instead. Additionally, variable names dec and n use disallowed abbreviations.
Suggested fix
- while let Ok(Some(packet)) = reader.read_packet() {
+ loop {
+ let packet = match reader.read_packet() {
+ Ok(Some(packet)) => packet,
+ Ok(None) => break,
+ Err(error) => return Err(WhisperError::Decode(error.to_string())),
+ };
if header_packets < 2 {
if header_packets == 0 {
// Parse OpusHead to get channel count and pre-skip.
if packet.data.len() >= 11 && &packet.data[0..8] == b"OpusHead" {
channels = packet.data[9] as usize;
// Output sample rate is always 48000 for libopus.
sample_rate = 48000;
+ } else {
+ return Err(WhisperError::Decode("invalid Opus header".into()));
}
decoder = Some(
opus::Decoder::new(
sample_rate,
if channels == 2 {
opus::Channels::Stereo
} else {
opus::Channels::Mono
},
)
.map_err(|e| WhisperError::Decode(e.to_string()))?,
);
}
header_packets += 1;
continue;
}
- let dec = decoder.as_mut().unwrap();
+ let decoder = decoder
+ .as_mut()
+ .ok_or_else(|| WhisperError::Decode("Opus decoder not initialized".into()))?;
// Max Opus frame: 120ms at 48kHz = 5760 samples per channel.
let max_samples = 5760 * channels;
let mut pcm = vec![0f32; max_samples];
- let n = dec
+ let samples_decoded = decoder
.decode_float(&packet.data, &mut pcm, false)
.map_err(|e| WhisperError::Decode(e.to_string()))?;Per coding guidelines: "Don't silently discard errors; use let _ = only on channel sends where the receiver may be dropped; handle, log, or propagate all other errors" and "Do not abbreviate variable names; use queue not q, message not msg, channel not ch."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/stt.rs` around lines 454 - 499, The loop currently swallows
reader.read_packet() errors and uses decoder.as_mut().unwrap() and abbreviated
names; change the packet loop to fully match reader.read_packet() and on Err(e)
return Err(WhisperError::Decode(e.to_string())), on Ok(None) break; replace
decoder.as_mut().unwrap() with a checked access that returns
Err(WhisperError::Decode("missing decoder".into())) if decoder is None; rename
local variables: dec -> decoder_mut or decoder_ref and n -> num_samples, and
rename inner ch -> channel to avoid abbreviations; ensure the decode_float error
is propagated as WhisperError::Decode and update subsequent uses to the new
variable names when extending samples or mixing down to mono.
| if !voice_model.is_empty() { | ||
| server = server.tool(TranscribeAudioTool::new( | ||
| voice_model, | ||
| llm_manager, | ||
| http, | ||
| )); | ||
| } |
There was a problem hiding this comment.
Trim voice_model before deciding tool registration.
Whitespace-only values currently register transcribe_audio and then fail at call time.
🔧 Proposed fix
- if !voice_model.is_empty() {
+ let trimmed_voice_model = voice_model.trim().to_string();
+ if !trimmed_voice_model.is_empty() {
server = server.tool(TranscribeAudioTool::new(
- voice_model,
+ trimmed_voice_model,
llm_manager,
http,
));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if !voice_model.is_empty() { | |
| server = server.tool(TranscribeAudioTool::new( | |
| voice_model, | |
| llm_manager, | |
| http, | |
| )); | |
| } | |
| let trimmed_voice_model = voice_model.trim().to_string(); | |
| if !trimmed_voice_model.is_empty() { | |
| server = server.tool(TranscribeAudioTool::new( | |
| trimmed_voice_model, | |
| llm_manager, | |
| http, | |
| )); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools.rs` around lines 434 - 440, Whitespace-only voice_model values
cause TranscribeAudioTool to be registered then fail; trim the voice_model
before the emptiness check and use the trimmed value when constructing
TranscribeAudioTool. Specifically, replace the current check of
voice_model.is_empty() with a check of a trimmed string (e.g., let trimmed =
voice_model.trim(); if trimmed.is_empty() { ... }) and pass that trimmed/owned
value into TranscribeAudioTool::new (the
server.tool(TranscribeAudioTool::new(...)) call) so only non-empty,
non-whitespace model names register.
| impl TranscribeAudioTool { | ||
| /// Create a new transcribe audio tool. | ||
| pub fn new( | ||
| voice_model: impl Into<String>, | ||
| llm_manager: Arc<LlmManager>, | ||
| http: reqwest::Client, | ||
| ) -> Self { | ||
| Self { | ||
| voice_model: voice_model.into(), | ||
| llm_manager, | ||
| http, | ||
| } |
There was a problem hiding this comment.
Enforce workspace path boundaries before reading files.
transcribe_audio currently reads any host path supplied by the model. This bypasses workspace-scoped file safety.
🔒 Proposed fix
use std::sync::Arc;
+use std::path::PathBuf;
pub struct TranscribeAudioTool {
/// The configured voice model spec (full `routing.voice` value).
voice_model: String,
+ workspace: PathBuf,
llm_manager: Arc<LlmManager>,
http: reqwest::Client,
}
impl TranscribeAudioTool {
pub fn new(
voice_model: impl Into<String>,
+ workspace: PathBuf,
llm_manager: Arc<LlmManager>,
http: reqwest::Client,
) -> Self {
Self {
voice_model: voice_model.into(),
+ workspace,
llm_manager,
http,
}
}
}
@@
async fn call(&self, args: Self::Args) -> Result<Self::Output, Self::Error> {
- let audio = tokio::fs::read(&args.path)
+ let requested = PathBuf::from(&args.path);
+ let candidate = if requested.is_absolute() {
+ requested
+ } else {
+ self.workspace.join(requested)
+ };
+
+ let workspace = std::fs::canonicalize(&self.workspace)
+ .map_err(|e| TranscribeAudioError(format!("invalid workspace: {e}")))?;
+ let canonical = std::fs::canonicalize(&candidate)
+ .map_err(|e| TranscribeAudioError(format!("failed to resolve {}: {e}", args.path)))?;
+ if !canonical.starts_with(&workspace) {
+ return Err(TranscribeAudioError(format!(
+ "path escapes workspace: {}",
+ args.path
+ )));
+ }
+
+ let audio = tokio::fs::read(&canonical)
.await
.map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?;Also applies to: 83-87
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/transcribe_audio.rs` around lines 24 - 35, TranscribeAudioTool
currently allows the model to supply arbitrary host paths; update the
file-reading logic in the transcribe_audio method (and the similar code at the
other site referenced around lines 83-87) to enforce workspace boundaries by
canonicalizing the requested path and the workspace root (resolve symlinks),
then ensure the canonicalized requested path is inside the canonicalized
workspace root (starts_with); if not, return an error/deny access and do not
read the file. Use the existing TranscribeAudioTool method and struct members
(or add a workspace_root field if needed) to perform this check before any
filesystem access.
| let audio = tokio::fs::read(&args.path) | ||
| .await | ||
| .map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?; | ||
|
|
There was a problem hiding this comment.
Add a file-size cap before reading audio into memory.
Reading arbitrary-size files into memory can cause avoidable memory pressure.
🔧 Proposed fix
+const MAX_AUDIO_BYTES: u64 = 25 * 1024 * 1024; // 25MB
@@
- let audio = tokio::fs::read(&canonical)
+ let metadata = tokio::fs::metadata(&canonical)
+ .await
+ .map_err(|e| TranscribeAudioError(format!("failed to stat {}: {}", args.path, e)))?;
+ if metadata.len() > MAX_AUDIO_BYTES {
+ return Err(TranscribeAudioError(format!(
+ "file too large: {} bytes (max {})",
+ metadata.len(),
+ MAX_AUDIO_BYTES
+ )));
+ }
+
+ let audio = tokio::fs::read(&canonical)
.await
.map_err(|e| TranscribeAudioError(format!("failed to read {}: {}", args.path, e)))?;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/transcribe_audio.rs` around lines 84 - 87, Before reading the audio
file into memory in transcribe_audio.rs, check the file size via
tokio::fs::metadata(&args.path) (or equivalent) and enforce a configurable size
cap (e.g., MAX_AUDIO_BYTES constant); if metadata.len() exceeds the cap return a
TranscribeAudioError indicating the file is too large instead of calling
tokio::fs::read. Update the code paths around args.path and the
TranscribeAudioError usage so you only read into memory after the size check and
ensure the error message includes the actual file size and the allowed limit.
Re-submission of #105, which was merged into the
voicebranch before that branch landed inmain— the commits never reachedmain.Local Whisper backend
Set
routing.voice = "whisper-local://<spec>"in config. When the channel processes an audio attachment, it bypasses the HTTP provider path and runs inference locally via whisper-rs.<spec>is a known size name (tiny,base,small,medium,large,large-v3) — downloaded fromggerganov/whisper.cppon HuggingFace — or an absolute path to a GGML model file. TheWhisperContextis cached for the process lifetime; switching models requires a restart.Audio decoding: Ogg/Opus (Telegram voice messages) is handled via the
ogg+opuscrates. All other formats fall through to symphonia. Both paths resample to 16 kHz mono f32 before Whisper.GPU acceleration via the
vulkanfeature (CUDA excluded due to GCC 14+/nvcc incompatibility on modern distros).Everything is behind the
stt-whispercargo feature. Builds without it are unaffected.STT unification + transcribe_audio worker tool
The transcription logic is extracted from
channel.rsinto a sharedstt::transcribe_bytes()function that handles both the local Whisper path and any OpenAI-compatible HTTP provider. Workers now get atranscribe_audiotool that reads a local audio file and transcribes it using whatever is configured inrouting.voice— no need to shell out to the whisper CLI.