⚡️ Speed up method PSBaseParser._parse_number by 47%#80
Open
codeflash-ai[bot] wants to merge 1 commit intomasterfrom
Open
⚡️ Speed up method PSBaseParser._parse_number by 47%#80codeflash-ai[bot] wants to merge 1 commit intomasterfrom
PSBaseParser._parse_number by 47%#80codeflash-ai[bot] wants to merge 1 commit intomasterfrom
Conversation
The optimized code achieves a **46% speedup** by eliminating Python overhead in two hot paths within `_parse_number` and `_parse_float`: **Key Optimizations:** 1. **Replaced `contextlib.suppress(ValueError)` with explicit try/except/else blocks** - The `contextlib.suppress` context manager adds significant overhead (creating a context manager object, entering/exiting it) on every invocation - Direct try/except/else is Python's native control flow and much faster - Line profiler shows the `with contextlib.suppress(ValueError):` line consumed **28.6%** of `_parse_number` runtime in the original (1.14ms), while the optimized try/except structure only takes **4.2%** (126μs) for the try statement itself - The actual conversion and token addition happens in the `else` branch only on success, avoiding unnecessary stack unwinding 2. **Avoided single-byte slice creation for dot comparison** - Changed `c = s[j : j + 1]; if c == b"."` to `if s[j] == 46` (where 46 is `ord(b".")`) - Indexing bytes directly returns an integer, avoiding the allocation of a one-byte bytes object - While individually small (~7% reduction in that specific check's time from 410ns to 366ns per hit), this adds up over 391 hits per test run **Why This Works:** - PDF parsing involves tokenizing thousands of numbers in typical documents. The parser calls `_parse_number` repeatedly in tight loops - Context managers in Python have non-trivial overhead: they require `__enter__`/`__exit__` method calls and exception handling setup even when no exception occurs - The explicit try/except/else pattern lets CPython's bytecode optimizer handle the common "no exception" path more efficiently - Single-byte slice allocation is wasteful when we can directly access the byte value as an integer **Performance Profile:** The annotated tests show consistent 28-51% speedups across diverse numeric inputs: - Simple integers: 34-49% faster - Large integers (150+ digits): 28-32% faster - Sequential parsing of multiple numbers: 47-51% faster - Edge cases (lone signs, invalid conversions): 28-44% faster The optimization particularly shines in the "large scale" test with 300 sequential numbers: **51.8% faster** (311μs → 205μs), demonstrating the cumulative benefit when `_parse_number` is called repeatedly—exactly the real-world PDF parsing scenario.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 47% (0.47x) speedup for
PSBaseParser._parse_numberinpdfminer/psparser.py⏱️ Runtime :
488 microseconds→332 microseconds(best of250runs)📝 Explanation and details
The optimized code achieves a 46% speedup by eliminating Python overhead in two hot paths within
_parse_numberand_parse_float:Key Optimizations:
Replaced
contextlib.suppress(ValueError)with explicit try/except/else blockscontextlib.suppresscontext manager adds significant overhead (creating a context manager object, entering/exiting it) on every invocationwith contextlib.suppress(ValueError):line consumed 28.6% of_parse_numberruntime in the original (1.14ms), while the optimized try/except structure only takes 4.2% (126μs) for the try statement itselfelsebranch only on success, avoiding unnecessary stack unwindingAvoided single-byte slice creation for dot comparison
c = s[j : j + 1]; if c == b"."toif s[j] == 46(where 46 isord(b"."))Why This Works:
_parse_numberrepeatedly in tight loops__enter__/__exit__method calls and exception handling setup even when no exception occursPerformance Profile:
The annotated tests show consistent 28-51% speedups across diverse numeric inputs:
The optimization particularly shines in the "large scale" test with 300 sequential numbers: 51.8% faster (311μs → 205μs), demonstrating the cumulative benefit when
_parse_numberis called repeatedly—exactly the real-world PDF parsing scenario.✅ Correctness verification report:
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-PSBaseParser._parse_number-mkqzioxmand push.