Skip to content

deps(rust): bump tokenizers from 0.22.2 to 0.23.1#48

Closed
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/cargo/tokenizers-0.22.2
Closed

deps(rust): bump tokenizers from 0.22.2 to 0.23.1#48
dependabot[bot] wants to merge 1 commit into
mainfrom
dependabot/cargo/tokenizers-0.22.2

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github Apr 25, 2026

⚠️ Dependabot is rebasing this PR ⚠️

Rebasing might not happen immediately, so don't worry if this takes some time.

Note: if you make any changes to this PR yourself, they will take precedence over the rebase.


Bumps tokenizers from 0.22.2 to 0.23.1.

Release notes

Sourced from tokenizers's releases.

Release v0.23.1

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.


🚨 Breaking changes

  • Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
  • add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
  • Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
  • 3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark v0.22.2 v0.23.1 change
100k tokens, special, no norm ~410 ms 248 ms −40%
100k tokens, non-special, no norm ~7.1 s 273 ms −96%
100k tokens, special, NFKC ~395 ms 235 ms −40%
100k tokens, non-special, NFKC ~7.4 s 290 ms −96%
400k tokens, special, no norm ~15 s 980 ms −94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark v0.22.2 v0.23.1 change
BPE GPT2 encode batch, no cache 530 ms 446 ms −16%
BPE GPT2 encode batch (cached) 690 ms 685 ms noise
BPE GPT2 encode (single) 1.95 s 1.94 s noise
BPE Train (small) 32.6 ms 31.5 ms −3%
BPE Train (big) 1.01 s 988 ms −2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

... (truncated)

Commits

@dependabot @github
Copy link
Copy Markdown
Contributor Author

dependabot Bot commented on behalf of github Apr 25, 2026

Labels

The following labels could not be found: dependencies, rust. Please create them before Dependabot can add them to a pull request.

Please fix the above issues or remove invalid values from dependabot.yml.

Bumps [tokenizers](https://github.com/huggingface/tokenizers) from 0.22.2 to 0.23.1.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.22.2...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.22.2
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot changed the title deps(rust): bump tokenizers from 0.20.4 to 0.22.2 deps(rust): bump tokenizers from 0.22.2 to 0.23.1 May 12, 2026
@dependabot dependabot Bot force-pushed the dependabot/cargo/tokenizers-0.22.2 branch from 9a0e403 to 30c245c Compare May 12, 2026 15:02
pacphi added a commit that referenced this pull request May 12, 2026
…ignore

- Workspace and finima-llm: tokenizers "0.22" → "0.23" (resolves to 0.23.1).
  Breaking changes in the 0.23 release are Python-binding-only; the Rust
  surface is unchanged. Cargo.lock now carries 0.21.4 (mistralrs internal),
  0.22.2 (ruvector-sona transitive), and 0.23.1 (our workspace constraint).

- audit-ignore: add RUSTSEC-2026-0002 (lru 0.12.5 IterMut unsoundness).
  The advisory is a transitive dep through aws-sdk-s3; we never call
  IterMut on the cache directly. No patched lru 0.12.x exists upstream —
  remove the entry once aws-sdk-s3 updates its lru dependency.

Closes the CI failure that was blocking PR #48.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pacphi
Copy link
Copy Markdown
Owner

pacphi commented May 12, 2026

Superseded — tokenizers was bumped to 0.23 (resolves to 0.23.1) in commit 68c017a ("deps(rust): bump tokenizers 0.22 → 0.23; add RUSTSEC-2026-0002 audit ignore"). The CI failure on this PR was caused by a missing audit-ignore entry for RUSTSEC-2026-0002 (lru 0.12.5 via aws-sdk-s3), which is now also addressed on main. Closing as no further action is needed.

@pacphi pacphi closed this May 12, 2026
@dependabot @github
Copy link
Copy Markdown
Contributor Author

dependabot Bot commented on behalf of github May 12, 2026

OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version, let me know by commenting @dependabot ignore this major version or @dependabot ignore this minor version. You can also ignore all major, minor, or patch releases for a dependency by adding an ignore condition with the desired update_types to your config file.

If you change your mind, just re-open this PR and I'll resolve any conflicts on it.

@dependabot dependabot Bot deleted the dependabot/cargo/tokenizers-0.22.2 branch May 12, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant