x86_64 AVX2 encode + decode by coderdan · Pull Request #6 · cipherstash/base85-simd

coderdan · 2026-05-01T13:09:18Z

Summary

Adds an AVX2-accelerated SIMD path for x86_64, parallel to the existing aarch64 NEON path. Each AVX2 iteration processes 8 base85 blocks (32 input bytes → 40 output chars on encode; 40 chars → 32 output bytes on decode) by running the SSE4.1 algorithm in parallel across both 128-bit lanes of an `__m256i`.

Hosts without AVX2 fall back to the scalar implementation via a one-shot `is_x86_feature_detected!("avx2")` check at the public API entry. The branch predicts perfectly across calls, so the cost is negligible.

Algorithm

Lane-restricted AVX2 byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine here because every step of the algorithm is per-lane anyway — each 128-bit half does an independent 4-block computation, equivalent to running the SSE4.1 algorithm twice in parallel within one register.

stage	NEON	x86 AVX2
Load 16 BE u32s per chunk	`vld1q_u8` + `vrev32q_u8`	`_mm256_loadu_si256` + `_mm256_shuffle_epi8`
Parallel-magic divides (q1, q2, q3 from n; q4 from q2)	`vmull_n_u32` chain	`_mm256_mul_epu32` chain
Five digits via mul-sub	`vmlsq_n_u32`	`_mm256_sub_epi32(..., _mm256_mullo_epi32(...))`
Splice 5 digits into output	`vqtbx1q_u8` chain	`_mm256_shuffle_epi8` + OR chain
Digit → ASCII (alphabet lookup)	one `vqtbl4q_u8` (64-entry)	six `_mm256_shuffle_epi8` (16-entry each) + chunk-index masks
Decode: char → digit	one `vqtbl4q_u8`	same 6-PSHUFB chunk pattern
Decode: overflow detect	unsigned `vcltq_u32`	bias + signed `_mm256_cmpgt_epi32`
Store output	`vst1q_u8`	per-lane `_mm_storeu_si128` (encode is non-contiguous between lanes) / `_mm256_storeu_si256` (decode contiguous)

Performance (GHA Ubuntu, AMD EPYC 7763)

Steady-state at ≥ 256 B:

op	base85 ref	base85-simd (AVX2)	speedup
encode	0.79 GiB/s	2.07 GiB/s	2.71×
decode	0.25 GiB/s	2.32 GiB/s	9.59×

Full per-size tables in the README. The decode ratio (~10×) matches what NEON gets on aarch64. Absolute throughput is roughly half of NEON (4.4 GiB/s) because PSHUFB is a 16-entry shuffle vs NEON's 64-entry TBL — AVX-512 VBMI's `vpermb` would close the gap but isn't on the runner fleet.

Small-input note: 16 B encode is slightly slower than the reference because AVX2's 32-byte chunk doesn't fit, so we hit the scalar fallback with one extra branch of overhead. Crossover happens around 64 B.

Notes

MSRV stays at Rust 1.85.
Inner `unsafe { ... }` blocks inside each `#[target_feature(enable = "avx2")]` `unsafe fn` are required on 1.85 (`unsafe_op_in_unsafe_fn` still applies to target-feature-gated intrinsics there). Newer toolchains may flag those blocks as redundant — module-level `#![allow(unused_unsafe)]` covers it.

Test plan

Stands up the cfg-gated module structure for an x86_64 SIMD implementation, wired into encode_into / decode_into the same way the aarch64 NEON path is. Bodies are scalar-fallback stubs for now — the structural contract (function signatures, magic constants, test harness) matches aarch64 so swapping in intrinsics one helper at a time is safe. # What's here src/ops/x86_64.rs — div_85, div_85_sq, div_85_cube (same magic constants as aarch64.rs; currently scalar-implemented stubs). Same quickcheck tests as the NEON path. src/ops/mod.rs — cfg-arm exporting div_* on x86_64. src/block.rs::sse — `SseEncoder` + `try_decode_block_x4` stubs that delegate to the scalar block functions. Each is annotated with a comment block sketching the target intrinsic chain (load, divmod, shuffle, store). src/lib.rs — encode_into / decode_into gain a #[cfg(target_arch = "x86_64")] arm mirroring the aarch64 one. The scalar-fallback gate becomes cfg(not(any(aarch64, x86_64))). # How to develop x86 SIMD locally on Apple Silicon Two checked-in workflows: rustup target add x86_64-apple-darwin cargo test --target x86_64-apple-darwin # runs under Rosetta cargo build --target x86_64-apple-darwin # cross-compile only cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings Rosetta translates SSE/AVX→NEON dynamically, so it's good for correctness but not for benchmarking. AVX-512 is unsupported. For real x86 perf numbers, run `cargo bench` on the GitHub Actions Ubuntu x86_64 runner (already in CI) or a cloud VM. # How to fill in the SIMD bodies Each stub has a numbered sketch of the intrinsic chain in its docstring. The aarch64 → x86 mapping cheat-sheet lives at the top of `src/ops/x86_64.rs`. The order I'd take is: 1. `div_magic` in src/ops/x86_64.rs (the simplest function, just `_mm_mul_epu32` + `_mm_srli_epi32`). Quickcheck against the existing tests in that file. 2. `SseEncoder::encode_block_x4` next, since the encoder's parallel-magic structure is straightforward to port. 3. `try_decode_block_x4` last, with the same overflow-detection trick the NEON version uses. # Verified locally cargo test — 47 tests pass on aarch64 cargo test --target x86_64-apple-darwin — 47 tests pass via Rosetta cargo clippy --all-targets -- -D warnings — clean (aarch64) cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean

Replaces the scalar-fallback stubs in `src/block.rs::sse` and `src/ops/x86_64.rs` with real SIMD bodies. Algorithm shape is unchanged from the NEON path; only the intrinsics differ. # div_magic (src/ops/x86_64.rs) Lane-wise libdivide: m = _mm_set1_epi32(MAGIC) even = _mm_mul_epu32(input, m) // touches lanes 0, 2 odd = _mm_mul_epu32(_mm_srli_epi64::<32>(input), m) // lanes 1, 3 pick the high u32 of each u64 product via _mm_shuffle_epi32::<0xF5> blend the two halves with _mm_blend_epi16::<0xCC> finally _mm_srli_epi32::<SHIFT> # encode_block_x4 (src/block.rs::sse) Same parallel-magic structure as NEON: q1, q2, q3 in parallel from n, then q4 = q2 / 85². Five digits via lane-wise multiply-subtract (_mm_mullo_epi32 + _mm_sub_epi32). Splice digits into out_1 (16 bytes) + out_2 (4 bytes) via PSHUFB+OR chains using the same byte-position indexes as NEON. Finally byte_to_char85_x86 (digit → ASCII). # byte_to_char85_x86 — corrected approach Hit a real x86 gotcha here. PSHUFB only zeros its output when bit 7 of the index byte is set; indices in 16..127 produce `table[index & 0x0F]`. The naive "subtract chunk-base, PSHUFB, OR" pattern that works on NEON's TBX (which has merge semantics) silently produces wrong bytes for most inputs (e.g. digit 84 returns t0[4]='4' from chunk 0 instead of 0). Fix: split each digit into a 4-bit high-nibble (chunk index 0..5) and a 4-bit low-nibble (entry within chunk). Run all 6 chunk PSHUFBs unconditionally with the low nibble; mask each result with `high_nib == N` (cmpeq) and OR. This costs ~22 ops vs NEON's single-instruction vqtbl4q_u8, which is intrinsic to x86 PSHUFB having only one 16-byte source. # try_decode_block_x4 (src/block.rs::sse) Mirrors NEON: load 16+4 chars, range-validate [33, 126] via cmpgt, ASCII→digit via the same chunk_idx-selection trick (96-byte CHAR_TO_85_PADDED table), permute to per-position vectors via two PSHUFBs (main + tail) ORed together, Horner via mullo+add, overflow detection via the bias-and-signed-cmpgt unsigned-cmplt trick (SSE has no native unsigned cmplt), big-endian store via PSHUFB + storeu. # Target-feature gating Every SIMD entry point carries `#[target_feature(enable = "sse4.1,ssse3")]` so the file compiles on any `x86_64-*-*` target regardless of default target-features (`x86_64-unknown-linux-gnu` defaults to just SSE2). The callers in `encode_into` / `decode_into` do a one-shot `std::is_x86_feature_detected!` check at the entry point and route to a scalar fallback (`encode_into_scalar` / `decode_into_scalar`) for hosts without SSE4.1+SSSE3. The branch predicts perfectly so per-call cost is negligible. # Verified cargo test — 47 tests pass (aarch64 native) cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta) cargo clippy --all-targets -- -D warnings — clean (aarch64) cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean (x86 mac) cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean Rosetta dynamically retargets SSE→NEON on Apple Silicon, so it validates correctness only — the perf numbers it produces are not representative of native x86. CI's Ubuntu x86_64 runner will give the first real perf data point.

Rust 1.85 still applies `unsafe_op_in_unsafe_fn` to target-feature- gated intrinsics inside an `unsafe fn`, so calls to `_mm_set1_epi32` etc. need explicit `unsafe { }` blocks even though the surrounding function is itself `unsafe fn` with matching `target_feature`. A later toolchain change relaxes that requirement and instead flags the same blocks as `unused_unsafe`. The fix: - Re-add the inner `unsafe { ... }` blocks around each SSE function body so 1.85 accepts them. - Add `#![allow(unused_unsafe)]` at the module level for both `src/ops/x86_64.rs` and the `src/block.rs::sse` submodule so newer toolchains don't break CI on the spurious warning. Verified locally: cargo +1.85 test --target x86_64-apple-darwin — 47 tests pass cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean (stable) cargo test — 47 tests pass (native)

The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each __m256i register independently run the algorithm we developed for SSE4.1, doubling per-call throughput. AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every step of the algorithm is per-lane anyway. # Why drop SSE4.1 entirely User direction: "if a target doesn't have avx2 it can fall back to the soft impl". AVX2 has been universal on x86 server hardware since Haswell (2013) and the bench workflow now guards on its presence (PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't worth the duplicated code. # What changed src/ops/x86_64.rs: - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i. - target_feature attribute changed from "sse4.1,ssse3" to "avx2". src/block.rs::sse → src/block.rs::avx2: - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8 (16-byte → 32-byte input, 20-byte → 40-byte output). - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in, 16-byte → 32-byte out). - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 + _mm256_broadcastsi128_si256 to populate both 128-bit lanes. - Encoder output is non-contiguous between lanes (lane 0 → out[0..20], lane 1 → out[20..40]) so we extract each 128-bit lane and the two 4-byte tails separately. Decoder output is 32 contiguous bytes so a single _mm256_storeu_si256 works. - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap at the tail boundary of each side); two _mm_loadu_si128s are combined with _mm256_set_m128i. src/lib.rs: - encode_into / decode_into runtime gate switches from is_x86_feature_detected!("sse4.1") && ...("ssse3") to just ...("avx2"). Falls back to the existing scalar paths otherwise. - Loop strides updated for the new chunk sizes (32→40 encode, 40→32 decode). # Verified locally cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta; hits the scalar fallback because Rosetta on this Mac doesn't expose AVX2) cargo test — 47 tests (aarch64 native) cargo +1.85 test --target x86_64-apple-darwin — 47 tests cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run that actually exercises the new AVX2 code paths end-to-end.

Splits the benchmarks section into per-architecture tables (aarch64 NEON / x86_64 AVX2) with the steady-state numbers from the GHA Ubuntu runner. Also rewords the top-of-README description so it mentions both SIMD targets up front instead of just NEON. Steady-state at ≥ 256 B: aarch64 NEON: 4.40 GiB/s encode (1.61× ref), 4.49 GiB/s decode (10.5× ref) x86_64 AVX2: 2.07 GiB/s encode (2.71× ref), 2.32 GiB/s decode (9.6× ref) NEON sustains roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8` does a 64-entry table lookup in one instruction, where PSHUFB is limited to 16 entries (so the alphabet lookup expands to ~6 PSHUFB+OR per chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but isn't on the GHA fleet.

…olidate scalar paths, hoist shared tables - AVX2 module + ops::x86_64 use `#![allow(unsafe_op_in_unsafe_fn)]` at module scope, dropping every inner `unsafe { … }` block. Same code compiles on Rust 1.85 (MSRV) and on toolchains that would otherwise flag the wrappers as unused. Mirrors the comment thread on PR #6. - Delete duplicated quickcheck tests in `ops::aarch64` and `ops::x86_64`. `tests/base85_parity.rs` round-trips through these on every iteration, and the dedicated tests added zero unique coverage (and required a runtime AVX2 skip that we'd rather not have). - Collapse `encode_into_scalar` / `decode_into_scalar` (x86 fallback) and the `not(any(aarch64, x86_64))` `encode_into` / `decode_into` into a single `encode_scalar` / `decode_scalar` gated on `not(aarch64)`. The non-aarch64-non-x86_64 arm becomes a one-line wrapper. - Hoist `IDX_OUT_1`, `IDX_OUT_2`, and `TAIL_VALID_MASK` to file scope in `block.rs` so NEON and AVX2 share one definition each. `REV32_IDX` and `IDX_DEC` stay in `mod avx2` — their shape is x86-specific (NEON uses `vrev32q_u8` and a 32-byte `vqtbl2q_u8` source instead). No behavioural change. Verified: cargo fmt, cargo clippy --all-targets, cargo test (24 lib + 18 parity + 1 doctest) on stable + 1.85, both aarch64-apple-darwin and x86_64-apple-darwin.

- Shorten the `#![allow(unsafe_op_in_unsafe_fn)]` rationale in mod avx2 and in src/ops/x86_64.rs to one sentence — the original prose read as change-history narration. - Drop the "tests removed because…" tombstones at the bottom of src/ops/{aarch64,x86_64}.rs. - Drop the explanatory comment on REV32_IDX in mod avx2; the location already says "x86 only". - Drop redundant doc comments on encode_scalar / decode_scalar that just restated the cfg gate. - Fix stale `crate::encode_into_scalar / crate::decode_into_scalar` cross-reference in block.rs (now `encode_scalar / decode_scalar`).

coderdan added 3 commits May 1, 2026 22:55

coderdan mentioned this pull request May 2, 2026

Add manual bench workflow #7

Merged

coderdan added 2 commits May 2, 2026 14:16

coderdan changed the title ~~Scaffold x86_64 SSE encode/decode (WIP)~~ x86_64 AVX2 encode + decode May 2, 2026

coderdan marked this pull request as ready for review May 2, 2026 04:39