Skip to content

perf(decode): inline VLQ segment parser and skip capacity pre-pass#330

Open
Boshen wants to merge 1 commit into
mainfrom
perf/decode-inline-vlq
Open

perf(decode): inline VLQ segment parser and skip capacity pre-pass#330
Boshen wants to merge 1 commit into
mainfrom
perf/decode-inline-vlq

Conversation

@Boshen
Copy link
Copy Markdown
Member

@Boshen Boshen commented May 25, 2026

Summary

Two micro-optimizations to `decode_mapping`:

  1. Skip the upfront `,` / `;` counting pass. It was used to pre-size the token `Vec` to an exact upper bound; replace with a `mapping.len() / 3` heuristic. Realistic sourcemap segments are 3-5 bytes, so this slightly over-allocates without paying for an extra full scan of the input.

  2. Inline `parse_vlq_segment_into` into the outer match arm and use a local `local_cursor` that's copied to/from the surrounding cursor at segment boundaries. With cursor state and the `nums` array visible at the function scope the optimizer can keep them in registers across iterations, and the function-call + reborrow overhead per segment is gone.

`get_unchecked` is used for the `mapping[cursor]` reads — every load is preceded by an explicit length check, so the bounds checks are already provably redundant.

Behavior, error variants, and the public API are unchanged.

Numbers (standalone vs main, no other PRs applied)

Small improvement, mostly under noise floor for the on-disk 3 KB fixtures. The win shows up best on the synthesized large fixture in #328:

benchmark before after Δ
parse/real_large 3.41 µs 3.34 µs ~−2%

Most parse savings come from #329 (borrowed deserialization); this PR is a small additional improvement that's primarily about removing redundant work.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 25, 2026

Merging this PR will not alter performance

⚡ 3 improved benchmarks
❌ 4 regressed benchmarks
✅ 9 untouched benchmarks
⏩ 5 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
build_single 7 µs 7.1 µs -1.64%
lookup_table[real_small] 1.3 µs 1.4 µs -2.13%
parse[real_large] 50.5 µs 48 µs +5.18%
parse[real_xlarge] 1.4 ms 1.3 ms +6.3%
serialize[real_medium] 5.1 µs 5 µs +1.16%
from_json_string_inline 14.1 µs 14.4 µs -1.66%
parse[real_small] 11.6 µs 12.1 µs -3.8%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing perf/decode-inline-vlq (eff0ff4) with main (db883f9)

Open in CodSpeed

Footnotes

  1. 5 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@Boshen Boshen force-pushed the perf/decode-inline-vlq branch from af3c7a8 to a901059 Compare May 25, 2026 05:42
Two micro-optimizations to `decode_mapping`:

1. Drop the upfront `,` / `;` counting pass that pre-sized the token
   `Vec` to an exact upper bound. Replace it with a `mapping.len() / 3`
   heuristic: real-world segments are 3-5 bytes, so this slightly
   over-allocates without paying for the extra full scan of the input.

2. Inline `parse_vlq_segment_into` into the outer match arm and use a
   local `local_cursor` (copied to/from the surrounding cursor at
   segment boundaries). With cursor state and the `nums` array
   visible at the function scope the optimizer can keep them in
   registers across iterations, and the function-call + reborrow
   overhead per segment is gone.

`get_unchecked` is used for the `mapping[cursor]` reads — every load
is preceded by an explicit length check, so the bounds checks are
already provably redundant.

Behavior, error variants, and the public API are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant