perf: misc allocation/branch reductions (round 2)#335
Conversation
Three small, independent wins on top of main: * **encode**: `x_google_ignoreList` integers no longer go through `u32::to_string()` per element. Inline a stack-buffer u32 → bytes conversion so the rare ignoreList encode path does zero allocations. * **concat builder**: `add_sourcemap` now extends `sources` / `source_contents` / `names` by iterating the input `Vec<Cow>`s directly. The previous `get_*()` accessors return `impl Iterator<Item = &str>` which hides the `ExactSizeIterator` impl from `extend`, forcing geometric growth of the output vecs. Going through `.iter().map(...)` preserves the exact-size hint, so each `extend` pre-reserves in one shot. * **builder**: drop the explicit `self.tokens.shrink_to_fit()` from `into_sourcemap`. `Vec::into_boxed_slice` already drops any excess capacity in a single allocation+copy; the standalone shrink was duplicate work on the same Vec.
Two more small wins: * **encode**: lookup into the 64-entry `B64_CHARS` table now uses `get_unchecked`. The optimizer doesn't reliably elide the bounds check across the loop-break boundary, even though `digit & 0b11111` is provably in `0..=31`. Worth ~3-4% on small/medium serialize. * **decode**: tokens are now constructed via `Token::new_raw`, a pub(crate) constructor that takes raw u32 ids using `INVALID_ID` as the absent-sentinel. The decoder already tracks ids that way, so this skips the previous `u32 → Option<u32> → u32` roundtrip through `Token::new`. Also marks Token's small getters `#[inline]` so accessor calls in hot loops reliably collapse to direct field reads. Worth ~1-4% across the parse sizes.
Merging this PR will degrade performance by 1.35%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ❌ | parse[real_medium] |
14.9 µs | 15.1 µs | -1.35% |
| ❌ | parse[real_small] |
11.6 µs | 11.8 µs | -1.48% |
| ❌ | from_json_string_inline |
14.1 µs | 14.3 µs | -1.22% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing perf/round2-misc-allocations (25593e5) with main (db883f9)
Footnotes
-
5 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
|
Closing per investigation: no sub-change in this PR touches the parse path algorithmically, yet CodSpeed reports a ~1.35% parse regression and itself warns of "different runtime environments". The effect appears to be binary-layout / i-cache sensitivity from reshuffling unrelated functions, not a real perf issue. Not worth the noise for the tiny wins on the existing fixtures. |
Summary
Four small, independent perf wins on top of main. Each removes one allocation, one branch, or one bounds check from a hot path.
x_google_ignoreListintegers now go through an inline stack-buffer u32 → bytes converter instead ofu32::to_string(). Zero allocations on the ignoreList encode path.B64_CHARStable is now indexed viaget_unchecked.digit & 0b11111is provably 0..=31, but the optimizer doesn't reliably elide the bounds check across the loop break. Worth ~3-4% on small/mediumserialize.Token::new_raw(...)that takes raw u32 ids withINVALID_IDas the absent-sentinel. The decoder already tracks ids that way, so this skips the previousu32 → Option<u32> → u32roundtrip throughToken::new. Also marks the small Token getters#[inline]. Worth ~1-4% on small/medium/largeparse.add_sourcemapextendssources/source_contents/namesby iterating the inputVec<Cow>s directly. Going through theget_*accessors returnedimpl Iterator<Item = &str>, which hid theExactSizeIteratorimpl fromextendand forced geometric growth. Direct field iteration preserves the exact-size hint so eachextendpre-reserves once.self.tokens.shrink_to_fit()frominto_sourcemap—Vec::into_boxed_slicebelow already drops excess capacity in one allocation+copy.Benchmarks
Wall-clock differences are mostly inside the criterion noise floor (±2-3%) on the existing perf fixtures. The wins these make are most visible on workloads with many sourcemaps being concatenated (where the
extendno-reserve was geometric) and on workloads with thousands of tokens (where the per-token bounds-check + Option roundtrip cost adds up). Composes additively with #330 and #331.