Skip to content
#

subword-tokenization

Here are 10 public repositories matching this topic...

Custom BPE tokenizer built from scratch on WikiText-2 (30k vocab). Covers data cleaning, deduplication, HuggingFace tokenizers training, evaluation (compression ratio, UNK-free coverage, consistency), and save/reload as PreTrainedTokenizerFast.

  • Updated Apr 14, 2026
  • Jupyter Notebook

Improve this page

Add a description, image, and links to the subword-tokenization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the subword-tokenization topic, visit your repo's landing page and select "manage topics."

Learn more