Skip to content

fix: prevent IndexError in Whisper timestamp decode on trailing replacement char#45006

Open
Krishnachaitanyakc wants to merge 2 commits intohuggingface:mainfrom
Krishnachaitanyakc:fix/whisper-timestamp-replacement-char
Open

fix: prevent IndexError in Whisper timestamp decode on trailing replacement char#45006
Krishnachaitanyakc wants to merge 2 commits intohuggingface:mainfrom
Krishnachaitanyakc:fix/whisper-timestamp-replacement-char

Conversation

@Krishnachaitanyakc
Copy link
Copy Markdown
Contributor

Summary

Fixes #44869

Adds a bounds check in _split_tokens_on_unicode() in tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing with IndexError.

Problem

When the decoded token stream ends with a dangling replacement character, the computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access.

Fix

Pre-compute target_index and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access. When triggered, the trailing fragment is treated as a word boundary.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed, verified for correctness, and tested against the reported edge case.

Test Plan

  • Verified bounds check correctly handles unicode_offset=298, len(decoded_full)=298 edge case
  • Confirmed Python ternary precedence is correct for the target_index computation
  • Ran ruff check with no issues

…cement char

When the decoded token stream ends with a dangling Unicode replacement
character (U+FFFD), the computed index in _split_tokens_on_unicode()
could equal len(decoded_full), causing an IndexError. Add a bounds check
so that an out-of-range index is treated as a word boundary instead of
crashing.

Fixes huggingface#44869
@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

@Rocketknight1
Copy link
Copy Markdown
Member

cc @ebezzam @eustlb for audio maybe?

@Krishnachaitanyakc Krishnachaitanyakc marked this pull request as ready for review March 26, 2026 14:03
@github-actions github-actions bot requested a review from ArthurZucker March 26, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Whisper word timestamp decode crashes on trailing replacement character at end of decoded token stream

2 participants