cipherstash · coderdan · May 2, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/README.md b/README.md
@@ -2,10 +2,11 @@
 
 Fast Base85 (RFC 1924 / Z85-style) encoder and decoder for Rust.
 
-On `aarch64` the encode path uses NEON intrinsics to process 16 input bytes
-per iteration; the decode path and the fallback for other architectures use
-a portable scalar implementation. Output is byte-for-byte compatible with
-the [`base85`](https://crates.io/crates/base85) crate.
+SIMD-accelerated on aarch64 (NEON, 4 blocks per iteration) and x86_64
+(AVX2, 8 blocks per iteration), with a portable scalar fallback for
+everything else and for x86_64 hosts lacking AVX2 (rare on server
+hardware after ~2013). Output is byte-for-byte compatible with the
+[`base85`](https://crates.io/crates/base85) crate.
 
 ## Usage
 
@@ -33,7 +34,10 @@ for decoding.
 ## Status
 
 - Public API: `encode(&[u8]) -> String`, `decode(&str) -> Result<Vec<u8>, DecodeError>`.
-- Encode and decode are both NEON-accelerated on `aarch64`, scalar elsewhere.
+- aarch64: NEON-accelerated (4 blocks at a time, always available on aarch64).
+- x86_64 with AVX2: AVX2-accelerated (8 blocks at a time). Runtime feature
+  detection at the public API entry — hosts without AVX2 fall back to scalar.
+- Other architectures: portable scalar implementation.
 - The decode path validates char range and detects `u32` overflow lane-wise;
   any invalid input falls back to the scalar path so the resulting
   `DecodeError` carries a precise byte position.
@@ -74,43 +78,82 @@ strictly more diagnostic information. Code that pattern-matches on
 
 ## Benchmarks
 
-Apple M-series (aarch64), `cargo bench --bench encode`, criterion, release
-profile, single-threaded. Times are the criterion-reported median; throughput
-is computed from the median.
-
-### Encode
-
-| size  | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
-|-------|--------------:|-------------------:|--------:|-------------------------:|
-| 16 B  | 17.4 ns       | 16.3 ns            | 1.07×   | ~940 MiB/s               |
-| 64 B  | 33.0 ns       | 22.6 ns            | 1.46×   | 2.71 GiB/s               |
-| 256 B | 115.7 ns      | 73.2 ns            | 1.58×   | 3.26 GiB/s               |
-| 1 KiB | 378.3 ns      | 225.1 ns           | 1.68×   | 4.24 GiB/s               |
-| 16 KiB| 5.56 µs       | 3.60 µs            | 1.55×   | 4.24 GiB/s               |
-| 256 KiB| 89.3 µs      | 55.4 µs            | 1.61×   | 4.41 GiB/s               |
-| 1 MiB | 356 µs        | 222 µs             | 1.61×   | 4.40 GiB/s               |
-
-### Decode
-
-| size  | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
-|-------|--------------:|-------------------:|--------:|-------------------------:|
-| 16 B  | 32.4 ns       | 14.8 ns            | 2.18×   | ~1.0 GiB/s               |
-| 64 B  | 123.5 ns      | 24.6 ns            | 5.02×   | 2.42 GiB/s               |
-| 256 B | 579 ns        | 57.6 ns            | 10.05×  | 4.14 GiB/s               |
-| 1 KiB | 2.27 µs       | 226 ns             | 10.06×  | 4.22 GiB/s               |
-| 16 KiB| 36.8 µs       | 3.49 µs            | 10.55×  | 4.38 GiB/s               |
-| 256 KiB| 591 µs       | 54.3 µs            | 10.89×  | 4.50 GiB/s               |
-| 1 MiB | 2.28 ms       | 217.6 µs           | 10.49×  | 4.49 GiB/s               |
+`cargo bench --bench encode`, criterion, release profile, single-threaded.
+Times are the criterion-reported median; throughput computed from it.
+
+### aarch64 (Apple M-series)
+
+#### Encode
+
+| size  | `base85` | `base85-simd` (NEON) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 17.4 ns  | 16.3 ns              | 1.07×   | ~940 MiB/s |
+| 64 B  | 33.0 ns  | 22.6 ns              | 1.46×   | 2.71 GiB/s |
+| 256 B | 115.7 ns | 73.2 ns              | 1.58×   | 3.26 GiB/s |
+| 1 KiB | 378.3 ns | 225.1 ns             | 1.68×   | 4.24 GiB/s |
+| 16 KiB| 5.56 µs  | 3.60 µs              | 1.55×   | 4.24 GiB/s |
+| 256 KiB| 89.3 µs | 55.4 µs              | 1.61×   | 4.41 GiB/s |
+| 1 MiB | 356 µs   | 222 µs               | 1.61×   | 4.40 GiB/s |
+
+#### Decode
+
+| size  | `base85` | `base85-simd` (NEON) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 32.4 ns  | 14.8 ns              | 2.18×   | ~1.0 GiB/s |
+| 64 B  | 123.5 ns | 24.6 ns              | 5.02×   | 2.42 GiB/s |
+| 256 B | 579 ns   | 57.6 ns              | 10.05×  | 4.14 GiB/s |
+| 1 KiB | 2.27 µs  | 226 ns               | 10.06×  | 4.22 GiB/s |
+| 16 KiB| 36.8 µs  | 3.49 µs              | 10.55×  | 4.38 GiB/s |
+| 256 KiB| 591 µs  | 54.3 µs              | 10.89×  | 4.50 GiB/s |
+| 1 MiB | 2.28 ms  | 217.6 µs             | 10.49×  | 4.49 GiB/s |
+
+### x86_64 (AMD EPYC 7763, Zen 3)
+
+Numbers from a GitHub Actions hosted Ubuntu runner — shared/virtualised
+hardware so noise is higher than aarch64 (~5–15% variance), but the
+relative speedups are stable.
+
+#### Encode
+
+| size  | `base85` | `base85-simd` (AVX2) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 41.0 ns  | 57.8 ns              | 0.71×   | (scalar fallback; chunk doesn't fit) |
+| 64 B  | 100.5 ns | 61.7 ns              | 1.63×   | 989 MiB/s  |
+| 256 B | 341.9 ns | 165.9 ns             | 2.06×   | 1.44 GiB/s |
+| 1 KiB | 1.32 µs  | 507.0 ns             | 2.61×   | 1.88 GiB/s |
+| 16 KiB| 20.4 µs  | 7.48 µs              | 2.72×   | 2.04 GiB/s |
+| 256 KiB| 323.9 µs| 118.9 µs             | 2.72×   | 2.05 GiB/s |
+| 1 MiB | 1.31 ms  | 482.9 µs             | 2.71×   | 2.07 GiB/s |
+
+#### Decode
+
+| size  | `base85` | `base85-simd` (AVX2) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 70.2 ns  | 61.2 ns              | 1.15×   | (scalar fallback) |
+| 64 B  | 244.8 ns | 67.6 ns              | 3.62×   | 903 MiB/s  |
+| 256 B | 1.046 µs | 164.4 ns             | 6.36×   | 1.45 GiB/s |
+| 1 KiB | 4.19 µs  | 466.5 ns             | 8.98×   | 2.04 GiB/s |
+| 16 KiB| 65.7 µs  | 6.75 µs              | 9.73×   | 2.26 GiB/s |
+| 256 KiB| 1.058 ms| 107.2 µs             | 9.86×   | 2.28 GiB/s |
+| 1 MiB | 4.14 ms  | 431.6 µs             | 9.59×   | 2.32 GiB/s |
 
 ### Steady-state summary
 
-At sizes large enough to amortise loop setup (≥ 256 B), `base85-simd`
-sustains **~4.4 GiB/s** for both encode and decode on Apple M-series,
-roughly **1.6× faster** than the reference for encode and **~10×
-faster** for decode. The decode advantage comes from `TBL`-based ASCII
-→ digit lookup that replaces the reference's per-character branchy
-match; the encode advantage comes from 4-lane parallel `divmod 85`
-plus `TBX`-based digit → ASCII / output permutation.
+At sizes large enough to amortise the SIMD loop setup (≥ 256 B):
+
+| arch / ISA  | encode throughput | encode speedup | decode throughput | decode speedup |
+|-------------|------------------:|---------------:|------------------:|---------------:|
+| aarch64 NEON| 4.40 GiB/s        | 1.61×          | 4.49 GiB/s        | 10.49×         |
+| x86_64 AVX2 | 2.07 GiB/s        | 2.71×          | 2.32 GiB/s        | 9.59×          |
+
+The decode speedup ratio is roughly the same on both architectures
+(~10×), driven by SIMD-accelerated ASCII → digit table lookup
+replacing the reference's per-character branchy match. NEON sustains
+roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8`
+does a 64-entry lookup in a single instruction, where x86 PSHUFB is
+limited to 16 entries (so the lookup expands to ~6 PSHUFB+OR per
+chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but
+isn't available on the AMD silicon used by GitHub's runner fleet.
 
 Reproduce with: