Summary
Currently, the tokenizers.js library only provides token IDs, tokens, and attention masks in the Encoding interface. This lacks the character-level position information (offset_mapping) that's
available in the HuggingFace tokenizers Python/Rust library.
Use Case
Character offset mapping is essential for:
- NLP applications that need to map tokens back to original text positions
- PII detection models (like NeuroBERT) where precise character locations are required
- Text highlighting/annotation based on tokenization results
- Model interpretability and token attribution
- Compatibility with HuggingFace tokenizers Python API
Current Behavior
const result = tokenizer.encode("Hello world!");
console.log(result);
// Output: {
// ids: [101, 7592, 2088, 999, 102],
// tokens: ["[CLS]", "hello", "world", "!", "[SEP]"],
// attention_mask: [1, 1, 1, 1, 1]
// // Missing: offset_mapping
// }
Expected Behavior
const result = tokenizer.encode("Hello world!", { return_offsets_mapping: true });
console.log(result.offset_mapping);
// Output: [[0,0], [0,5], [6,11], [11,12], [0,0]]
// [CLS] Hello world ! [SEP]
Requirements
- Add offset_mapping?: Array<[number, number]> to Encoding interface
- Add return_offsets_mapping?: boolean option to encode() method
- Implement character position tracking for WordPiece tokenizers
- Handle special tokens with [0,0] offsets (HF standard)
- Support subword tokens (## prefix) correctly
- Provide clear error messages for unsupported model types (BPE, Unigram)
- Maintain backward compatibility
Models to Support
Priority 1 (WordPiece):
- ✅ BERT, DistilBERT, ALBERT
- ✅ Medical/domain BERT variants (e.g., NeuroBERT, BioBERT)
Future (out of scope for initial implementation):
- BPE models (GPT, RoBERTa) - complex merge operations
- Unigram models (T5, XLM-R) - probabilistic segmentation
Additional Context
This feature would bring tokenizers.js closer to feature parity with the official HuggingFace tokenizers library and enable advanced NLP use cases that require precise character-level alignment.
Summary
Currently, the tokenizers.js library only provides token IDs, tokens, and attention masks in the
Encodinginterface. This lacks the character-level position information (offset_mapping) that'savailable in the HuggingFace tokenizers Python/Rust library.
Use Case
Character offset mapping is essential for:
Current Behavior