Skip to content

Add offset mapping support for character-level token alignment #16

@dkhokhlov

Description

@dkhokhlov

Summary

Currently, the tokenizers.js library only provides token IDs, tokens, and attention masks in the Encoding interface. This lacks the character-level position information (offset_mapping) that's
available in the HuggingFace tokenizers Python/Rust library.

Use Case

Character offset mapping is essential for:

  • NLP applications that need to map tokens back to original text positions
  • PII detection models (like NeuroBERT) where precise character locations are required
  • Text highlighting/annotation based on tokenization results
  • Model interpretability and token attribution
  • Compatibility with HuggingFace tokenizers Python API

Current Behavior

const result = tokenizer.encode("Hello world!");
console.log(result);
// Output: {
//   ids: [101, 7592, 2088, 999, 102],
//   tokens: ["[CLS]", "hello", "world", "!", "[SEP]"],
//   attention_mask: [1, 1, 1, 1, 1]
//   // Missing: offset_mapping
// }

Expected Behavior

const result = tokenizer.encode("Hello world!", { return_offsets_mapping: true });
console.log(result.offset_mapping);
// Output: [[0,0], [0,5], [6,11], [11,12], [0,0]]
//         [CLS]  Hello  world    !      [SEP]

Requirements

- Add offset_mapping?: Array<[number, number]> to Encoding interface
- Add return_offsets_mapping?: boolean option to encode() method
- Implement character position tracking for WordPiece tokenizers
- Handle special tokens with [0,0] offsets (HF standard)
- Support subword tokens (## prefix) correctly
- Provide clear error messages for unsupported model types (BPE, Unigram)
- Maintain backward compatibility

Models to Support

Priority 1 (WordPiece):
-  BERT, DistilBERT, ALBERT
-  Medical/domain BERT variants (e.g., NeuroBERT, BioBERT)

Future (out of scope for initial implementation):
- BPE models (GPT, RoBERTa) - complex merge operations
- Unigram models (T5, XLM-R) - probabilistic segmentation

Additional Context

This feature would bring tokenizers.js closer to feature parity with the official HuggingFace tokenizers library and enable advanced NLP use cases that require precise character-level alignment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions