Skip to content

feat: add offset mapping support for WordPiece tokenizers#17

Open
dkhokhlov wants to merge 1 commit intohuggingface:mainfrom
dkhokhlov:add_offset_mapping_support_for_wordpiece_tokenizers
Open

feat: add offset mapping support for WordPiece tokenizers#17
dkhokhlov wants to merge 1 commit intohuggingface:mainfrom
dkhokhlov:add_offset_mapping_support_for_wordpiece_tokenizers

Conversation

@dkhokhlov
Copy link
Copy Markdown

  • Add offset_mapping field to Encoding interface
  • Implement calculateWordPieceOffsets utility for character position tracking
  • Add return_offsets_mapping option to Tokenizer.encode() method
  • Handle WordPiece subword tokens (## prefix) correctly
  • Map special tokens to [0,0] offsets as per HF standard
  • Add error handling for unsupported model types (BPE, Unigram)

@dkhokhlov dkhokhlov marked this pull request as ready for review December 11, 2025 02:27
@dkhokhlov dkhokhlov force-pushed the add_offset_mapping_support_for_wordpiece_tokenizers branch from 7da8fc1 to 0a53799 Compare December 11, 2025 03:02
@dkhokhlov dkhokhlov force-pushed the add_offset_mapping_support_for_wordpiece_tokenizers branch from 0a53799 to ad96967 Compare December 11, 2025 03:16
@xenova xenova linked an issue Jan 20, 2026 that may be closed by this pull request
@xenova
Copy link
Copy Markdown
Collaborator

xenova commented Jan 20, 2026

Thanks for the PR! Could you add a lengthy set of unit tests to make sure this matches the rust library? Also, did you reference the rust library when implementing the PR? It's usually easier to align with their implementation to make sure it matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add offset mapping support for character-level token alignment

2 participants