Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 82 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

Fast Base85 (RFC 1924 / Z85-style) encoder and decoder for Rust.

On `aarch64` the encode path uses NEON intrinsics to process 16 input bytes
per iteration; the decode path and the fallback for other architectures use
a portable scalar implementation. Output is byte-for-byte compatible with
the [`base85`](https://crates.io/crates/base85) crate.
SIMD-accelerated on aarch64 (NEON, 4 blocks per iteration) and x86_64
(AVX2, 8 blocks per iteration), with a portable scalar fallback for
everything else and for x86_64 hosts lacking AVX2 (rare on server
hardware after ~2013). Output is byte-for-byte compatible with the
[`base85`](https://crates.io/crates/base85) crate.

## Usage

Expand Down Expand Up @@ -33,7 +34,10 @@ for decoding.
## Status

- Public API: `encode(&[u8]) -> String`, `decode(&str) -> Result<Vec<u8>, DecodeError>`.
- Encode and decode are both NEON-accelerated on `aarch64`, scalar elsewhere.
- aarch64: NEON-accelerated (4 blocks at a time, always available on aarch64).
- x86_64 with AVX2: AVX2-accelerated (8 blocks at a time). Runtime feature
detection at the public API entry — hosts without AVX2 fall back to scalar.
- Other architectures: portable scalar implementation.
- The decode path validates char range and detects `u32` overflow lane-wise;
any invalid input falls back to the scalar path so the resulting
`DecodeError` carries a precise byte position.
Expand Down Expand Up @@ -74,43 +78,82 @@ strictly more diagnostic information. Code that pattern-matches on

## Benchmarks

Apple M-series (aarch64), `cargo bench --bench encode`, criterion, release
profile, single-threaded. Times are the criterion-reported median; throughput
is computed from the median.

### Encode

| size | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
|-------|--------------:|-------------------:|--------:|-------------------------:|
| 16 B | 17.4 ns | 16.3 ns | 1.07× | ~940 MiB/s |
| 64 B | 33.0 ns | 22.6 ns | 1.46× | 2.71 GiB/s |
| 256 B | 115.7 ns | 73.2 ns | 1.58× | 3.26 GiB/s |
| 1 KiB | 378.3 ns | 225.1 ns | 1.68× | 4.24 GiB/s |
| 16 KiB| 5.56 µs | 3.60 µs | 1.55× | 4.24 GiB/s |
| 256 KiB| 89.3 µs | 55.4 µs | 1.61× | 4.41 GiB/s |
| 1 MiB | 356 µs | 222 µs | 1.61× | 4.40 GiB/s |

### Decode

| size | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
|-------|--------------:|-------------------:|--------:|-------------------------:|
| 16 B | 32.4 ns | 14.8 ns | 2.18× | ~1.0 GiB/s |
| 64 B | 123.5 ns | 24.6 ns | 5.02× | 2.42 GiB/s |
| 256 B | 579 ns | 57.6 ns | 10.05× | 4.14 GiB/s |
| 1 KiB | 2.27 µs | 226 ns | 10.06× | 4.22 GiB/s |
| 16 KiB| 36.8 µs | 3.49 µs | 10.55× | 4.38 GiB/s |
| 256 KiB| 591 µs | 54.3 µs | 10.89× | 4.50 GiB/s |
| 1 MiB | 2.28 ms | 217.6 µs | 10.49× | 4.49 GiB/s |
`cargo bench --bench encode`, criterion, release profile, single-threaded.
Times are the criterion-reported median; throughput computed from it.

### aarch64 (Apple M-series)

#### Encode

| size | `base85` | `base85-simd` (NEON) | speedup | throughput |
|-------|---------:|---------------------:|--------:|-----------:|
| 16 B | 17.4 ns | 16.3 ns | 1.07× | ~940 MiB/s |
| 64 B | 33.0 ns | 22.6 ns | 1.46× | 2.71 GiB/s |
| 256 B | 115.7 ns | 73.2 ns | 1.58× | 3.26 GiB/s |
| 1 KiB | 378.3 ns | 225.1 ns | 1.68× | 4.24 GiB/s |
| 16 KiB| 5.56 µs | 3.60 µs | 1.55× | 4.24 GiB/s |
| 256 KiB| 89.3 µs | 55.4 µs | 1.61× | 4.41 GiB/s |
| 1 MiB | 356 µs | 222 µs | 1.61× | 4.40 GiB/s |

#### Decode

| size | `base85` | `base85-simd` (NEON) | speedup | throughput |
|-------|---------:|---------------------:|--------:|-----------:|
| 16 B | 32.4 ns | 14.8 ns | 2.18× | ~1.0 GiB/s |
| 64 B | 123.5 ns | 24.6 ns | 5.02× | 2.42 GiB/s |
| 256 B | 579 ns | 57.6 ns | 10.05× | 4.14 GiB/s |
| 1 KiB | 2.27 µs | 226 ns | 10.06× | 4.22 GiB/s |
| 16 KiB| 36.8 µs | 3.49 µs | 10.55× | 4.38 GiB/s |
| 256 KiB| 591 µs | 54.3 µs | 10.89× | 4.50 GiB/s |
| 1 MiB | 2.28 ms | 217.6 µs | 10.49× | 4.49 GiB/s |

### x86_64 (AMD EPYC 7763, Zen 3)

Numbers from a GitHub Actions hosted Ubuntu runner — shared/virtualised
hardware so noise is higher than aarch64 (~5–15% variance), but the
relative speedups are stable.

#### Encode

| size | `base85` | `base85-simd` (AVX2) | speedup | throughput |
|-------|---------:|---------------------:|--------:|-----------:|
| 16 B | 41.0 ns | 57.8 ns | 0.71× | (scalar fallback; chunk doesn't fit) |
| 64 B | 100.5 ns | 61.7 ns | 1.63× | 989 MiB/s |
| 256 B | 341.9 ns | 165.9 ns | 2.06× | 1.44 GiB/s |
| 1 KiB | 1.32 µs | 507.0 ns | 2.61× | 1.88 GiB/s |
| 16 KiB| 20.4 µs | 7.48 µs | 2.72× | 2.04 GiB/s |
| 256 KiB| 323.9 µs| 118.9 µs | 2.72× | 2.05 GiB/s |
| 1 MiB | 1.31 ms | 482.9 µs | 2.71× | 2.07 GiB/s |

#### Decode

| size | `base85` | `base85-simd` (AVX2) | speedup | throughput |
|-------|---------:|---------------------:|--------:|-----------:|
| 16 B | 70.2 ns | 61.2 ns | 1.15× | (scalar fallback) |
| 64 B | 244.8 ns | 67.6 ns | 3.62× | 903 MiB/s |
| 256 B | 1.046 µs | 164.4 ns | 6.36× | 1.45 GiB/s |
| 1 KiB | 4.19 µs | 466.5 ns | 8.98× | 2.04 GiB/s |
| 16 KiB| 65.7 µs | 6.75 µs | 9.73× | 2.26 GiB/s |
| 256 KiB| 1.058 ms| 107.2 µs | 9.86× | 2.28 GiB/s |
| 1 MiB | 4.14 ms | 431.6 µs | 9.59× | 2.32 GiB/s |

### Steady-state summary

At sizes large enough to amortise loop setup (≥ 256 B), `base85-simd`
sustains **~4.4 GiB/s** for both encode and decode on Apple M-series,
roughly **1.6× faster** than the reference for encode and **~10×
faster** for decode. The decode advantage comes from `TBL`-based ASCII
→ digit lookup that replaces the reference's per-character branchy
match; the encode advantage comes from 4-lane parallel `divmod 85`
plus `TBX`-based digit → ASCII / output permutation.
At sizes large enough to amortise the SIMD loop setup (≥ 256 B):

| arch / ISA | encode throughput | encode speedup | decode throughput | decode speedup |
|-------------|------------------:|---------------:|------------------:|---------------:|
| aarch64 NEON| 4.40 GiB/s | 1.61× | 4.49 GiB/s | 10.49× |
| x86_64 AVX2 | 2.07 GiB/s | 2.71× | 2.32 GiB/s | 9.59× |

The decode speedup ratio is roughly the same on both architectures
(~10×), driven by SIMD-accelerated ASCII → digit table lookup
replacing the reference's per-character branchy match. NEON sustains
roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8`
does a 64-entry lookup in a single instruction, where x86 PSHUFB is
limited to 16 entries (so the lookup expands to ~6 PSHUFB+OR per
chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but
isn't available on the AMD silicon used by GitHub's runner fleet.

Reproduce with:

Expand Down
Loading
Loading