Skip to content

Conversation

@ChrisQlasty
Copy link
Contributor

Problem:
The input transcript text from Live Gemini models (tested on gemini-live-2.5-flash-preview-native-audio-09-2025) is not properly stitched. Words are often glued as new chunks do not guarantee a whitespace at end/beginning.

Solution:
Added logic that properly handles words stitching for the final transcript which comes with the finished=True flag.
I do not see it as very elegant but that solves most common issues with the final input transcript.

Testing Plan

Added unit test with several cases that were observed to be not working with ADK 1.20.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
image

Manual End-to-End (E2E) Tests:
Configure live Gemini model to return input_transcription. Talk with the model and print event.input_transcription.text when event.input_transcription.finished=True. Talk a lot and observe badly stitched fragments.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChrisQlasty, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where transcription text from Live Gemini models was not being correctly assembled, leading to improperly stitched words. The changes introduce a refined algorithm for concatenating transcription fragments, ensuring that spaces are inserted appropriately while respecting punctuation and apostrophes. This significantly improves the readability and accuracy of the final transcribed text.

Highlights

  • Improved Transcription Stitching: Implemented logic to correctly join transcription fragments, preventing words from being glued together due to missing or incorrect spacing, especially around punctuation and apostrophes, for both input and output transcripts.
  • Punctuation Handling: Introduced a PUNCTUATION_CHARS set to intelligently decide whether to insert a space when concatenating transcription chunks based on leading punctuation or trailing apostrophes.
  • Comprehensive Unit Tests: Added a new parameterized unit test to validate the transcription stitching logic for both input and output, covering various scenarios with punctuation and apostrophes to ensure robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the live [Component] This issue is related to live, voice and video chat label Dec 8, 2025
@adk-bot
Copy link
Collaborator

adk-bot commented Dec 8, 2025

Response from ADK Triaging Agent

Hello @ChrisQlasty, thank you for creating this PR!

This PR is a bug fix. Could you please associate a GitHub issue with this PR? If there is no existing issue, could you please create one?

This information will help reviewers to review your PR more efficiently. Thanks!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue with stitching transcription fragments from Live Gemini models by introducing logic to correctly handle whitespace and punctuation. The solution is effective and is supported by a comprehensive set of new unit tests that cover various edge cases. My review focuses on improving code maintainability by reducing duplication. I've suggested refactoring the duplicated transcription stitching logic into a helper method in gemini_llm_connection.py and simplifying the test setup in test_gemini_llm_connection.py with a helper function. These changes will make the code cleaner and easier to manage in the future.

Comment on lines +189 to +203
existing = self._input_transcription_text
# Insert a space when joining fragments except when the new
# chunk starts with a punctuation character that should attach
# to the previous token, or the existing text ends with an
# apostrophe.
conditional_space = (
' '
if existing
and not (
new_input_transcription_chunk[0] in PUNCTUATION_CHARS
or existing.endswith("'")
)
else ''
)
self._input_transcription_text = f'{existing}{conditional_space}{new_input_transcription_chunk.strip()}'.strip()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for stitching transcription chunks is duplicated for both input_transcription (here) and output_transcription (lines 226-240). To improve maintainability and adhere to the Don't Repeat Yourself (DRY) principle, this logic should be extracted into a private helper method.

For example, you could create a method like this:

def _stitch_transcription_chunk(self, existing_text: str, new_chunk: str) -> str:
    if not new_chunk:
        return existing_text

    # Insert a space when joining fragments except when the new
    # chunk starts with a punctuation character that should attach
    # to the previous token, or the existing text ends with an
    # apostrophe.
    conditional_space = (
        ' '
        if existing_text
        and not (
            new_chunk[0] in PUNCTUATION_CHARS
            or existing_text.endswith("'")
        )
        else ''
    )
    return f'{existing_text}{conditional_space}{new_chunk.strip()}'.strip()

Then you could call it like so:
self._input_transcription_text = self._stitch_transcription_chunk(self._input_transcription_text, new_input_transcription_chunk)

Comment on lines +622 to +680
message1 = mock.Mock()
message1.usage_metadata = None
message1.server_content = mock.Mock()
message1.server_content.model_turn = None
message1.server_content.interrupted = False
message1.server_content.turn_complete = False
message1.server_content.generation_complete = False
message1.tool_call = None
message1.session_resumption_update = None
message1.server_content.input_transcription = (
types.Transcription(text=fragment1, finished=False)
if tx_direction == 'input'
else None
)
message1.server_content.output_transcription = (
types.Transcription(text=fragment1, finished=False)
if tx_direction == 'output'
else None
)

message2 = mock.Mock()
message2.usage_metadata = None
message2.server_content = mock.Mock()
message2.server_content.model_turn = None
message2.server_content.interrupted = False
message2.server_content.turn_complete = False
message2.server_content.generation_complete = False
message2.tool_call = None
message2.session_resumption_update = None
message2.server_content.input_transcription = (
types.Transcription(text=fragment2, finished=False)
if tx_direction == 'input'
else None
)
message2.server_content.output_transcription = (
types.Transcription(text=fragment2, finished=False)
if tx_direction == 'output'
else None
)

message3 = mock.Mock()
message3.usage_metadata = None
message3.server_content = mock.Mock()
message3.server_content.model_turn = None
message3.server_content.interrupted = False
message3.server_content.turn_complete = False
message3.server_content.generation_complete = False
message3.tool_call = None
message3.session_resumption_update = None
message3.server_content.input_transcription = (
types.Transcription(text=None, finished=True)
if tx_direction == 'input'
else None
)
message3.server_content.output_transcription = (
types.Transcription(text=None, finished=True)
if tx_direction == 'output'
else None
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The setup for message1, message2, and message3 is quite verbose and contains a lot of repeated code. To improve readability and maintainability, consider extracting the message creation into a helper function.

For example, you could define a helper within the test file:

def _create_mock_transcription_message(text: str | None, finished: bool, direction: str) -> mock.Mock:
    msg = mock.Mock()
    msg.usage_metadata = None
    msg.server_content = mock.Mock()
    msg.server_content.model_turn = None
    msg.server_content.interrupted = False
    msg.server_content.turn_complete = False
    msg.server_content.generation_complete = False
    msg.tool_call = None
    msg.session_resumption_update = None

    transcription = types.Transcription(text=text, finished=finished)
    if direction == 'input':
        msg.server_content.input_transcription = transcription
        msg.server_content.output_transcription = None
    else:
        msg.server_content.input_transcription = None
        msg.server_content.output_transcription = transcription
    return msg

Then, you could simplify the test setup significantly:

message1 = _create_mock_transcription_message(fragment1, False, tx_direction)
message2 = _create_mock_transcription_message(fragment2, False, tx_direction)
message3 = _create_mock_transcription_message(None, True, tx_direction)

@ryanaiagent ryanaiagent self-assigned this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

live [Component] This issue is related to live, voice and video chat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants