A friendly, profile‑driven Byte‑Pair Encoding (BPE) tokenizer for LLMs.
Toksmith is the successor to my earlier educational repo minbpe and was heavily inspired by the tokenization lectures in Stanford’s CS336 Language Modeling from Scratch course. The goal: keep the clarity of a naive reference implementation while squeezing out every last bit of Python‑only performance.
-
Baseline code is great for learning, but painfully slow at scale.
-
Careful profiling shows two low‑hanging fruit:
- Parallel pre‑tokenization – slice the corpus into chunks, feed them to multiple processes, then merge counters. This brings an immediate wall‑clock speed‑up.
- Smart merging – during training only the tokens that actually contain the current top‑pair need their statistics updated. The
FastMergerkeeps adjacency sets + a max‑heap, turning each merge into an O(k log n) blip instead of a full sweep. ~2.9× faster in practice.
Put together, these optimisations spotlight how a profiling‑first mindset can turn a plain‑Python prototype into a snappy, memory‑frugal workhorse—no native extensions required.
-
Two training modes out‑of‑the‑box:
Tokenizer.train(...)– baseline, easy to follow, perfect for profiling demos.Tokenizer.train_from_file(...)– fast path using multiprocessing +FastMerger.
-
JSON save / load for reproducible vocab/merge tables.
-
Complete unit‑test coverage (run
pytest). -
Zero external deps beyond
regex.
from toksmith.tokenizer import Tokenizer
# Initialise
special = ["<|endoftext|>"]
trainer = Tokenizer()
# 1️⃣ Learning‑oriented baseline
text = open("tiny.txt", "r", encoding="utf‑8").read()
trainer.train(text, vocab_size=8192, special_tokens=special)
# 2️⃣ Production‑minded fast path (multiprocess + FastMerger)
trainer.train_from_file(
"wiki.txt",
vocab_size=8192,
special_tokens=special,
)
# Persist to disk
trainer.save_state(prefix="wiki", folder="./artifacts")# Train with default settings (quiet)
python src/toksmith/train.py -i wiki.txt -N 8192
# With special tokens and verbose output
python src/toksmith/train.py -i wiki.txt -N 8192 \
-s "<|endoftext|>" \
-vvUse
-v/-vvto get progress logs, or-qto silence everything except errors.
Heads‑up:
encode()anddecode()aren’t wired up yet. They’re next on the roadmap.
| Benchmark | bench_train_base |
bench_train_merge (2.92× faster) |
bench_train_full (7.13× faster) |
|---|---|---|---|
| pyperf | 953 s | 327 s | 134 s |
Benchmarks generated with pyperf compare_to. See benchmarks/ for reproducible scripts.
- add encode/decode functionality
- maybe compare with the hugging face tokenizer?
Contributions & ideas very welcome!
MIT © 2025 Alexey Demyanchuk