fix: prevent IndexError in Whisper timestamp decode on trailing replacement char by Krishnachaitanyakc · Pull Request #45006 · huggingface/transformers

Krishnachaitanyakc · 2026-03-25T23:03:00Z

Summary

Adds a bounds check in _split_tokens_on_unicode() in tokenization_whisper.py to handle trailing Unicode replacement characters (U+FFFD) at the end of decoded token streams without crashing with IndexError.

Problem

When the decoded token stream ends with a dangling replacement character, the computed index unicode_offset + decoded.index(replacement_char) can equal len(decoded_full), causing an out-of-bounds string access.

Fix

Pre-compute target_index and add a target_index >= len(decoded_full) guard that short-circuits before the out-of-bounds access. When triggered, the trailing fragment is treated as a word boundary.

AI Assistance Disclosure

This PR was developed with AI assistance. The fix has been manually reviewed, verified for correctness, and tested against the reported edge case.

Test Plan

Verified bounds check correctly handles unicode_offset=298, len(decoded_full)=298 edge case
Confirmed Python ternary precedence is correct for the target_index computation
Ran ruff check with no issues

…cement char When the decoded token stream ends with a dangling Unicode replacement character (U+FFFD), the computed index in _split_tokens_on_unicode() could equal len(decoded_full), causing an IndexError. Add a bounds check so that an out-of-range index is treated as a word boundary instead of crashing. Fixes huggingface#44869

github-actions · 2026-03-25T23:04:10Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: whisper

Rocketknight1 · 2026-03-26T13:35:47Z

cc @ebezzam @eustlb for audio maybe?

Merge branch 'main' into fix/whisper-timestamp-replacement-char

e371364

Krishnachaitanyakc marked this pull request as ready for review March 26, 2026 14:03

github-actions bot requested a review from ArthurZucker March 26, 2026 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent IndexError in Whisper timestamp decode on trailing replacement char#45006

fix: prevent IndexError in Whisper timestamp decode on trailing replacement char#45006
Krishnachaitanyakc wants to merge 2 commits intohuggingface:mainfrom
Krishnachaitanyakc:fix/whisper-timestamp-replacement-char

Krishnachaitanyakc commented Mar 25, 2026

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Rocketknight1 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Krishnachaitanyakc commented Mar 25, 2026

Summary

Problem

Fix

AI Assistance Disclosure

Test Plan

Uh oh!

github-actions bot commented Mar 25, 2026

Uh oh!

Rocketknight1 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants