Skip to content

feat(plugins-soniox): surface per-run language segments#1602

Open
rosetta-livekit-bot[bot] wants to merge 4 commits into
mainfrom
manuring-hurling-eloped
Open

feat(plugins-soniox): surface per-run language segments#1602
rosetta-livekit-bot[bot] wants to merge 4 commits into
mainfrom
manuring-hurling-eloped

Conversation

@rosetta-livekit-bot
Copy link
Copy Markdown
Contributor

@rosetta-livekit-bot rosetta-livekit-bot Bot commented May 25, 2026

Summary

Fixes #5685 (and the follow-up source-side symptom raised in the comment thread, which @chenghao-mou approved bundling into the same PR).

Both halves are the same plugin bug: _TokenAccumulator._lang_segments is built per-run by the existing coalescing logic but then dropped in send_endpoint_transcript (and the interim path). The fix surfaces it through new SpeechData fields on the target side, and stops dropping it on the source side in non-translation mode.

Changes

  • stt.SpeechData: add target_languages / target_texts (symmetric to existing source_languages / source_texts). Same LanguageCode coercion in __post_init__. Default None, so the addition is strictly additive for every other plugin.
  • Soniox plugin, translation mode: populate target_* from final._lang_segments on FINAL_TRANSCRIPT and INTERIM_TRANSCRIPT / PREFLIGHT_TRANSCRIPT. Consumers now see the per-run target breakdown for code-switched two-way translation, e.g. target_languages=["en", "es"] / target_texts=["Hello, how are you?", " Estoy bien, gracias."] for the translation of "Hello, ¿cómo estás? I'm doing fine, gracias.".
  • Soniox plugin, non-translation mode: populate source_* from the same accumulator (previously None). A code-switched ja + en utterance now surfaces source_languages=["ja", "en"] / source_texts=["こんにちは、私の名はサムです。", " My name is Sam."] -- matches what the SpeechData docstring already promised for "multi-language detection services".
  • Refactor: extract a _lang_segments_to_fields helper to DRY the conversion across both modes and both event paths; the four duplicated inline list comprehensions collapse to one named operation. The predicate that distinguishes source from target became data-presence-based (final_original._lang_segments) rather than config-based (is_translation_mode is not None), which is what unified both halves cleanly.

SpeechData.text and SpeechData.language are unchanged for back-compat (still the full concatenation and the first translated/detected language, respectively).

Test plan

  • 14 new unit tests in tests/test_plugin_soniox_stt.py covering:
    • SpeechData.__post_init__ target_languages coercion (strings → LanguageCode, None stays None, existing LanguageCode passthrough)
    • _TokenAccumulator._lang_segments per-run coalescing
    • _lang_segments_to_fields helper edge cases (empty → (None, None), non-empty → parallel lists with LanguageCode coercion)
    • Two-way translation, code-switched (the issue's canonical example)
    • One-way translation (single target run)
    • "none" untranslated chunk inside a translated utterance (asymmetric per-run list lengths)
    • Interim path: translation mode merging final + non-final per run on both sides
    • Interim path: non-translation mode populates source_* from final + non-final merged
    • Non-translation single-language: source_* populated, target_* None
    • Non-translation code-switched JA+EN: source_* carries the per-run breakdown
  • Live-verified end-to-end against the real Soniox WebSocket API in console mode:
    • Translation mode, code-switched "Hello, ¿cómo estás? I'm doing fine, gracias."text="Hello, how are you? Estoy bien, gracias.", target_languages=["en", "es"], target_texts=["Hello, how are you?", " Estoy bien, gracias."], "".join(target_texts) == text. Source side unchanged.
    • Non-translation mode, code-switched " こんにちは、私の名はサムです。 My name is Sam."text=" こんにちは、私の名はサムです。 My name is Sam.", source_languages=["ja", "en"], source_texts=[" こんにちは、私の名はサムです。", " My name is Sam."], target_* correctly None. Interim events also surface the multi-language source breakdown progressively as the user code-switches.
  • ruff format clean, ruff check clean, no new mypy --strict errors introduced in changed files.

Follow-ups (intentionally not in this PR)

  • The final / final_original accumulator names are honest about routing today but the new target_* fields make their two-mode roles more glaring (final is "primary user-facing accumulator", final_original is "source-side accumulator that's empty in non-translation mode"). Worth a separate behavior-preserving rename PR to final_primary / final_source.
  • The new target_* fields are wired in Soniox only; other translation-capable plugins (Gladia, Deepgram v2, AWS) can adopt them in follow-up PRs.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 25, 2026

🦋 Changeset detected

Latest commit: 836f146

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 34 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-soniox Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-fishaudio Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-hume Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-mistralai Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-perplexity Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-tavus Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration[bot]

This comment was marked as resolved.

chenghao-mou and others added 3 commits May 25, 2026 20:12
Co-authored-by: rosetta-livekit-bot[bot] <282703043+rosetta-livekit-bot[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 9 additional findings in Devin Review.

Open in Devin Review

Comment thread plugins/soniox/src/stt.ts
if (data === SpeechStream.FLUSH_SENTINEL) {
continue;
}
ws.send(data.data.buffer);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Sending data.data.buffer may transmit incorrect bytes when AudioFrame's typed array is a view into a larger ArrayBuffer

In #sendAudio, ws.send(data.data.buffer) sends the entire underlying ArrayBuffer of the Int16Array. If the AudioFrame.data typed array is a view with a non-zero byteOffset or doesn't span the full buffer (e.g., after resampling or slicing), this sends more/wrong bytes than intended. Other plugins (e.g., Deepgram, ElevenLabs) typically send the typed array directly or use Buffer.from(data.buffer, data.byteOffset, data.byteLength) to handle this correctly.

Affected code in stt.ts

ws.send(data.data.buffer) should be ws.send(Buffer.from(data.data.buffer, data.data.byteOffset, data.data.byteLength)) or simply ws.send(data.data) which the ws library handles correctly for typed arrays.

Suggested change
ws.send(data.data.buffer);
ws.send(Buffer.from(data.data.buffer, data.data.byteOffset, data.data.byteLength));
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread plugins/soniox/src/stt.ts
});

try {
await Promise.race([sendTask, listenTask, waitForAbort(this.abortSignal)]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Using Promise.race instead of waiting for both send and listen tasks causes premature WebSocket closure and lost final transcripts

In #runWS, await Promise.race([sendTask, listenTask, waitForAbort(this.abortSignal)]) means that when the audio input ends and sendTask resolves, the code immediately enters the finally block and closes the WebSocket — without waiting for the server to send its final transcription and finished message. Unlike Deepgram which uses Promise.all([sendTask(), listenTask.result, ...]) to wait for both sides to complete, this plugin closes the connection before receiving the server's final response. While message handlers are technically still attached during the WebSocket closing handshake, this relies on fragile timing behavior and the server being fast enough to flush before the close completes.

Prompt for agents
In plugins/soniox/src/stt.ts in the #runWS method, the Promise.race on line 262 causes the WebSocket to close as soon as sendTask resolves (audio input ends), without waiting for the server to send back its final transcription and 'finished' message. The fix should restructure this so that after sendTask completes, the code waits for listenTask to resolve (i.e., the server sends finished or error), while still respecting the abort signal. A common pattern (used by the Deepgram plugin) is to use Promise.all for sendTask+listenTask and then race that against the abort signal. Something like: await Promise.race([Promise.all([sendTask, listenTask]), waitForAbort(this.abortSignal)]). This ensures the server has time to flush remaining transcriptions after audio input ends.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant