Skip to content

Use dual replace to improve performance of normalize_name#254

Open
hugovk wants to merge 3 commits into
pypa:masterfrom
hugovk:speedup-canonicalize_name
Open

Use dual replace to improve performance of normalize_name#254
hugovk wants to merge 3 commits into
pypa:masterfrom
hugovk:speedup-canonicalize_name

Conversation

@hugovk
Copy link
Copy Markdown

@hugovk hugovk commented Jan 10, 2026

We can apply @henryiii's improvement to packaging in pypa/packaging#1030 (see also https://iscinumpy.dev/post/packaging-faster/) to improve the performance of normalize_name and make it ~3.4 times faster.

Benchmark

Run normalize_name(n) on every name in PyPI:

# benchmark_names_distlib.py
import sqlite3
import timeit
from distlib.util import normalize_name

# Get data with:
# curl -L https://github.com/pypi-data/pypi-json-data/releases/download/latest/pypi-data.sqlite.gz | gzip -d > pypi-data.sqlite
# Or ues pre-cached files from:
# https://gist.github.com/hugovk/efdbee0620cc64df7b405b52cf0b6e42

CACHE_FILE = "/tmp/bench/names.txt"
DB_FILE = "/tmp/bench/pypi-data.sqlite"

try:
    with open(CACHE_FILE) as f:
        TEST_ALL_NAMES = [line.rstrip("\n") for line in f]
except FileNotFoundError:
    TEST_ALL_NAMES = []
    with sqlite3.connect(DB_FILE) as conn:
        with open(CACHE_FILE, "w") as cache:
            for (name,) in conn.execute("SELECT name FROM projects"):
                if name:
                    TEST_ALL_NAMES.append(name)
                    cache.write(name + "\n")


def bench():
    for n in TEST_ALL_NAMES:
        normalize_name(n)


if __name__ == "__main__":
    print(f"Loaded {len(TEST_ALL_NAMES):,} names")
    t = timeit.timeit("bench()", globals=globals(), number=1)
    print(f"Time: {t:.4f} seconds")

Benchmark data can be found at https://gist.github.com/hugovk/efdbee0620cc64df7b405b52cf0b6e42

Before

With Python 3.14 on macOS:

python benchmark_names_distlib.py
Loaded 8,344,947 names
Time: 4.6224 seconds

After

python benchmark_names_distlib.py
Loaded 8,344,947 names
Time: 1.3598 seconds

3.4 times faster.

hugovk and others added 2 commits January 10, 2026 16:21
Co-Authored-By: Henry Schreiner <henryschreineriii@gmail.com>
@hugovk
Copy link
Copy Markdown
Author

hugovk commented Jan 19, 2026

Following on from pypa/packaging#1064, this is slower with 3.12 and 3.13, so marking as draft for now.

Testing on Python 3.8 to 3.14 on macOS (python.org versions) using hyperfine:

Python master (s) PR (s) Result
3.8 7.699 ± 0.620 4.620 ± 0.040 PR 1.67x faster
3.9 7.463 ± 0.480 4.775 ± 0.131 PR 1.56x faster
3.10 6.042 ± 0.019 3.947 ± 0.213 PR 1.53x faster
3.11 5.437 ± 0.144 3.598 ± 0.144 PR 1.51x faster
3.12 5.707 ± 0.358 6.907 ± 0.059 ⚠️ master 1.21x faster
3.13 5.248 ± 0.067 6.479 ± 0.163 ⚠️ master 1.23x faster
3.14 5.784 ± 0.605 2.391 ± 0.061 PR 2.42x faster
Details
hyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.8 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.8 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      7.699 s ±  0.620 s    [User: 6.703 s, System: 0.455 s]
  Range (min … max):    7.054 s …  8.290 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      4.620 s ±  0.040 s    [User: 4.381 s, System: 0.199 s]
  Range (min … max):    4.574 s …  4.650 s    3 runs

Summary
  PR ran
    1.67 ± 0.14 times faster than master

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 49shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.9 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.9 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      7.463 s ±  0.480 s    [User: 6.627 s, System: 0.416 s]
  Range (min … max):    7.111 s …  8.009 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      4.775 s ±  0.131 s    [User: 4.379 s, System: 0.296 s]
  Range (min … max):    4.679 s …  4.924 s    3 runs

Summary
  PR ran
    1.56 ± 0.11 times faster than master

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 50shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.10 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.10 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      6.042 s ±  0.019 s    [User: 5.561 s, System: 0.329 s]
  Range (min … max):    6.021 s …  6.054 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      3.947 s ±  0.213 s    [User: 3.503 s, System: 0.311 s]
  Range (min … max):    3.751 s …  4.174 s    3 runs

Summary
  PR ran
    1.53 ± 0.08 times faster than master

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 41shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.11 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.11 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      5.437 s ±  0.144 s    [User: 4.972 s, System: 0.274 s]
  Range (min … max):    5.279 s …  5.563 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      3.598 s ±  0.144 s    [User: 3.198 s, System: 0.265 s]
  Range (min … max):    3.463 s …  3.750 s    3 runs

Summary
  PR ran
    1.51 ± 0.07 times faster than master

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 43shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.12 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.12 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      5.707 s ±  0.358 s    [User: 5.005 s, System: 0.369 s]
  Range (min … max):    5.439 s …  6.113 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      6.907 s ±  0.059 s    [User: 6.439 s, System: 0.331 s]
  Range (min … max):    6.846 s …  6.963 s    3 runs

Summary
  master ran
    1.21 ± 0.08 times faster than PR

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 48shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.13 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.13 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      5.248 s ±  0.067 s    [User: 4.906 s, System: 0.226 s]
  Range (min … max):    5.183 s …  5.316 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      6.479 s ±  0.163 s    [User: 6.002 s, System: 0.315 s]
  Range (min … max):    6.328 s …  6.652 s    3 runs

Summary
  master ran
    1.23 ± 0.03 times faster than PR

distlib on  speedup-canonicalize_name [?] via 🐍 v3.14.2 via 💎 v3.1.3 took 31shyperfine --warmup 1 -r 3 \
    -n master --prepare 'git checkout master' 'python3.14 benchmark_names_distlib.py' \
    -n PR --prepare 'git checkout speedup-canonicalize_name -q' 'python3.14 benchmark_names_distlib.py'
Benchmark 1: master
  Time (mean ± σ):      5.784 s ±  0.605 s    [User: 5.174 s, System: 0.275 s]
  Range (min … max):    5.370 s …  6.478 s    3 runs

Benchmark 2: PR
  Time (mean ± σ):      2.391 s ±  0.061 s    [User: 2.067 s, System: 0.223 s]
  Range (min … max):    2.321 s …  2.431 s    3 runs

Summary
  PR ran
    2.42 ± 0.26 times faster than master

@hugovk hugovk marked this pull request as draft January 19, 2026 18:34
@hugovk
Copy link
Copy Markdown
Author

hugovk commented Mar 22, 2026

Updated to use the faster version from pypa/packaging#1064.

@hugovk hugovk marked this pull request as ready for review March 22, 2026 21:07
@hugovk hugovk changed the title Use str.translate to improve performance of normalize_name Use dual replace to improve performance of normalize_name Mar 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.22%. Comparing base (674a491) to head (8e613c3).
⚠️ Report is 33 commits behind head on master.

Files with missing lines Patch % Lines
distlib/util.py 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #254      +/-   ##
==========================================
- Coverage   81.49%   81.22%   -0.28%     
==========================================
  Files          24       24              
  Lines        8885     8963      +78     
  Branches     1747     1538     -209     
==========================================
+ Hits         7241     7280      +39     
- Misses       1300     1344      +44     
+ Partials      344      339       -5     
Flag Coverage Δ
unittests 80.37% <50.00%> (-0.25%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant