Add offset mapping support for character-level token alignment

 ## Summary

  Currently, the tokenizers.js library only provides token IDs, tokens, and attention masks in the `Encoding` interface. This lacks the character-level position information (`offset_mapping`) that's
  available in the HuggingFace tokenizers Python/Rust library.

  ## Use Case

  Character offset mapping is essential for:
  - **NLP applications** that need to map tokens back to original text positions
  - **PII detection models** (like NeuroBERT) where precise character locations are required
  - **Text highlighting/annotation** based on tokenization results
  - **Model interpretability** and token attribution
  - **Compatibility** with HuggingFace tokenizers Python API

  ## Current Behavior

  ```javascript
  const result = tokenizer.encode("Hello world!");
  console.log(result);
  // Output: {
  //   ids: [101, 7592, 2088, 999, 102],
  //   tokens: ["[CLS]", "hello", "world", "!", "[SEP]"],
  //   attention_mask: [1, 1, 1, 1, 1]
  //   // Missing: offset_mapping
  // }

  Expected Behavior

  const result = tokenizer.encode("Hello world!", { return_offsets_mapping: true });
  console.log(result.offset_mapping);
  // Output: [[0,0], [0,5], [6,11], [11,12], [0,0]]
  //         [CLS]  Hello  world    !      [SEP]

  Requirements

  - Add offset_mapping?: Array<[number, number]> to Encoding interface
  - Add return_offsets_mapping?: boolean option to encode() method
  - Implement character position tracking for WordPiece tokenizers
  - Handle special tokens with [0,0] offsets (HF standard)
  - Support subword tokens (## prefix) correctly
  - Provide clear error messages for unsupported model types (BPE, Unigram)
  - Maintain backward compatibility

  Models to Support

  Priority 1 (WordPiece):
  - ✅ BERT, DistilBERT, ALBERT
  - ✅ Medical/domain BERT variants (e.g., NeuroBERT, BioBERT)

  Future (out of scope for initial implementation):
  - BPE models (GPT, RoBERTa) - complex merge operations
  - Unigram models (T5, XLM-R) - probabilistic segmentation

  Additional Context

  This feature would bring tokenizers.js closer to feature parity with the official HuggingFace tokenizers library and enable advanced NLP use cases that require precise character-level alignment.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offset mapping support for character-level token alignment #16

Summary

Use Case

Current Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add offset mapping support for character-level token alignment #16

Description

Summary

Use Case

Current Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions