feat: add offset mapping support for WordPiece tokenizers by dkhokhlov · Pull Request #17 · huggingface/tokenizers.js

dkhokhlov · 2025-12-11T02:26:26Z

Add offset_mapping field to Encoding interface
Implement calculateWordPieceOffsets utility for character position tracking
Add return_offsets_mapping option to Tokenizer.encode() method
Handle WordPiece subword tokens (## prefix) correctly
Map special tokens to [0,0] offsets as per HF standard
Add error handling for unsupported model types (BPE, Unigram)

xenova · 2026-01-20T18:42:45Z

Thanks for the PR! Could you add a lengthy set of unit tests to make sure this matches the rust library? Also, did you reference the rust library when implementing the PR? It's usually easier to align with their implementation to make sure it matches.

dkhokhlov marked this pull request as ready for review December 11, 2025 02:27

dkhokhlov force-pushed the add_offset_mapping_support_for_wordpiece_tokenizers branch from 7da8fc1 to 0a53799 Compare December 11, 2025 03:02

feat: add offset mapping support for WordPiece tokenization

ad96967

dkhokhlov force-pushed the add_offset_mapping_support_for_wordpiece_tokenizers branch from 0a53799 to ad96967 Compare December 11, 2025 03:16

xenova mentioned this pull request Jan 20, 2026

Add offset mapping support for character-level token alignment #16

Open

xenova linked an issue Jan 20, 2026 that may be closed by this pull request

Add offset mapping support for character-level token alignment #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add offset mapping support for WordPiece tokenizers#17

feat: add offset mapping support for WordPiece tokenizers#17
dkhokhlov wants to merge 1 commit intohuggingface:mainfrom
dkhokhlov:add_offset_mapping_support_for_wordpiece_tokenizers

dkhokhlov commented Dec 11, 2025

Uh oh!

xenova commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dkhokhlov commented Dec 11, 2025

Uh oh!

xenova commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xenova commented Jan 20, 2026 •

edited

Loading