Skip to content

x86_64 AVX2 encode + decode#6

Merged
coderdan merged 7 commits intomainfrom
x86-simd
May 2, 2026
Merged

x86_64 AVX2 encode + decode#6
coderdan merged 7 commits intomainfrom
x86-simd

Conversation

@coderdan
Copy link
Copy Markdown
Collaborator

@coderdan coderdan commented May 1, 2026

Summary

Adds an AVX2-accelerated SIMD path for x86_64, parallel to the existing aarch64 NEON path. Each AVX2 iteration processes 8 base85 blocks (32 input bytes → 40 output chars on encode; 40 chars → 32 output bytes on decode) by running the SSE4.1 algorithm in parallel across both 128-bit lanes of an `__m256i`.

Hosts without AVX2 fall back to the scalar implementation via a one-shot `is_x86_feature_detected!("avx2")` check at the public API entry. The branch predicts perfectly across calls, so the cost is negligible.

Algorithm

Lane-restricted AVX2 byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine here because every step of the algorithm is per-lane anyway — each 128-bit half does an independent 4-block computation, equivalent to running the SSE4.1 algorithm twice in parallel within one register.

stage NEON x86 AVX2
Load 16 BE u32s per chunk `vld1q_u8` + `vrev32q_u8` `_mm256_loadu_si256` + `_mm256_shuffle_epi8`
Parallel-magic divides (q1, q2, q3 from n; q4 from q2) `vmull_n_u32` chain `_mm256_mul_epu32` chain
Five digits via mul-sub `vmlsq_n_u32` `_mm256_sub_epi32(..., _mm256_mullo_epi32(...))`
Splice 5 digits into output `vqtbx1q_u8` chain `_mm256_shuffle_epi8` + OR chain
Digit → ASCII (alphabet lookup) one `vqtbl4q_u8` (64-entry) six `_mm256_shuffle_epi8` (16-entry each) + chunk-index masks
Decode: char → digit one `vqtbl4q_u8` same 6-PSHUFB chunk pattern
Decode: overflow detect unsigned `vcltq_u32` bias + signed `_mm256_cmpgt_epi32`
Store output `vst1q_u8` per-lane `_mm_storeu_si128` (encode is non-contiguous between lanes) / `_mm256_storeu_si256` (decode contiguous)

Performance (GHA Ubuntu, AMD EPYC 7763)

Steady-state at ≥ 256 B:

op base85 ref base85-simd (AVX2) speedup
encode 0.79 GiB/s 2.07 GiB/s 2.71×
decode 0.25 GiB/s 2.32 GiB/s 9.59×

Full per-size tables in the README. The decode ratio (~10×) matches what NEON gets on aarch64. Absolute throughput is roughly half of NEON (4.4 GiB/s) because PSHUFB is a 16-entry shuffle vs NEON's 64-entry TBL — AVX-512 VBMI's `vpermb` would close the gap but isn't on the runner fleet.

Small-input note: 16 B encode is slightly slower than the reference because AVX2's 32-byte chunk doesn't fit, so we hit the scalar fallback with one extra branch of overhead. Crossover happens around 64 B.

Notes

  • MSRV stays at Rust 1.85.
  • Inner `unsafe { ... }` blocks inside each `#[target_feature(enable = "avx2")]` `unsafe fn` are required on 1.85 (`unsafe_op_in_unsafe_fn` still applies to target-feature-gated intrinsics there). Newer toolchains may flag those blocks as redundant — module-level `#![allow(unused_unsafe)]` covers it.

Test plan

  • `cargo test --target x86_64-apple-darwin` (47 tests; Rosetta on this Mac doesn't expose AVX2 so this only validates the scalar fallback locally)
  • `cargo test` on aarch64 native (47 tests, NEON path)
  • `cargo +1.85 test --target x86_64-apple-darwin` (47 tests, MSRV)
  • `cargo clippy --all-targets -- -D warnings` (clean)
  • `cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings` (clean)
  • `cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings` (clean)
  • `cargo check --target x86_64-unknown-linux-gnu --all-targets` (clean)
  • `cargo fmt --all -- --check` (clean)
  • CI matrix on this PR — all 9 checks green (this is the first real AVX2 validation on Ubuntu x86 hardware)
  • `gh workflow run bench --ref main -f ref=x86-simd` — round-trip + parity tests run inside criterion's bench setup

coderdan added 3 commits May 1, 2026 22:55
Stands up the cfg-gated module structure for an x86_64 SIMD
implementation, wired into encode_into / decode_into the same way
the aarch64 NEON path is. Bodies are scalar-fallback stubs for now
— the structural contract (function signatures, magic constants,
test harness) matches aarch64 so swapping in intrinsics one helper
at a time is safe.

# What's here

  src/ops/x86_64.rs       — div_85, div_85_sq, div_85_cube
                            (same magic constants as aarch64.rs;
                            currently scalar-implemented stubs).
                            Same quickcheck tests as the NEON path.

  src/ops/mod.rs          — cfg-arm exporting div_* on x86_64.

  src/block.rs::sse       — `SseEncoder` + `try_decode_block_x4`
                            stubs that delegate to the scalar
                            block functions. Each is annotated
                            with a comment block sketching the
                            target intrinsic chain (load, divmod,
                            shuffle, store).

  src/lib.rs              — encode_into / decode_into gain a
                            #[cfg(target_arch = "x86_64")] arm
                            mirroring the aarch64 one. The
                            scalar-fallback gate becomes
                            cfg(not(any(aarch64, x86_64))).

# How to develop x86 SIMD locally on Apple Silicon

Two checked-in workflows:

  rustup target add x86_64-apple-darwin
  cargo test   --target x86_64-apple-darwin    # runs under Rosetta
  cargo build  --target x86_64-apple-darwin    # cross-compile only
  cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings

Rosetta translates SSE/AVX→NEON dynamically, so it's good for
correctness but not for benchmarking. AVX-512 is unsupported.
For real x86 perf numbers, run `cargo bench` on the GitHub
Actions Ubuntu x86_64 runner (already in CI) or a cloud VM.

# How to fill in the SIMD bodies

Each stub has a numbered sketch of the intrinsic chain in its
docstring. The aarch64 → x86 mapping cheat-sheet lives at the top
of `src/ops/x86_64.rs`. The order I'd take is:

  1. `div_magic` in src/ops/x86_64.rs (the simplest function,
     just `_mm_mul_epu32` + `_mm_srli_epi32`). Quickcheck against
     the existing tests in that file.
  2. `SseEncoder::encode_block_x4` next, since the encoder's
     parallel-magic structure is straightforward to port.
  3. `try_decode_block_x4` last, with the same overflow-detection
     trick the NEON version uses.

# Verified locally

  cargo test                        — 47 tests pass on aarch64
  cargo test --target x86_64-apple-darwin  — 47 tests pass via Rosetta
  cargo clippy --all-targets -- -D warnings  — clean (aarch64)
  cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings  — clean
  cargo check --target x86_64-unknown-linux-gnu --all-targets  — clean
  cargo fmt --all -- --check        — clean
Replaces the scalar-fallback stubs in `src/block.rs::sse` and
`src/ops/x86_64.rs` with real SIMD bodies. Algorithm shape is
unchanged from the NEON path; only the intrinsics differ.

# div_magic (src/ops/x86_64.rs)

Lane-wise libdivide:

  m = _mm_set1_epi32(MAGIC)
  even = _mm_mul_epu32(input, m)             // touches lanes 0, 2
  odd  = _mm_mul_epu32(_mm_srli_epi64::<32>(input), m)  // lanes 1, 3
  pick the high u32 of each u64 product via _mm_shuffle_epi32::<0xF5>
  blend the two halves with _mm_blend_epi16::<0xCC>
  finally _mm_srli_epi32::<SHIFT>

# encode_block_x4 (src/block.rs::sse)

Same parallel-magic structure as NEON: q1, q2, q3 in parallel from n,
then q4 = q2 / 85². Five digits via lane-wise multiply-subtract
(_mm_mullo_epi32 + _mm_sub_epi32). Splice digits into out_1 (16 bytes)
+ out_2 (4 bytes) via PSHUFB+OR chains using the same byte-position
indexes as NEON. Finally byte_to_char85_x86 (digit → ASCII).

# byte_to_char85_x86 — corrected approach

Hit a real x86 gotcha here. PSHUFB only zeros its output when bit 7
of the index byte is set; indices in 16..127 produce
`table[index & 0x0F]`. The naive "subtract chunk-base, PSHUFB, OR"
pattern that works on NEON's TBX (which has merge semantics)
silently produces wrong bytes for most inputs (e.g. digit 84
returns t0[4]='4' from chunk 0 instead of 0).

Fix: split each digit into a 4-bit high-nibble (chunk index 0..5)
and a 4-bit low-nibble (entry within chunk). Run all 6 chunk
PSHUFBs unconditionally with the low nibble; mask each result with
`high_nib == N` (cmpeq) and OR. This costs ~22 ops vs NEON's
single-instruction vqtbl4q_u8, which is intrinsic to x86 PSHUFB
having only one 16-byte source.

# try_decode_block_x4 (src/block.rs::sse)

Mirrors NEON: load 16+4 chars, range-validate [33, 126] via cmpgt,
ASCII→digit via the same chunk_idx-selection trick (96-byte
CHAR_TO_85_PADDED table), permute to per-position vectors via two
PSHUFBs (main + tail) ORed together, Horner via mullo+add, overflow
detection via the bias-and-signed-cmpgt unsigned-cmplt trick
(SSE has no native unsigned cmplt), big-endian store via PSHUFB +
storeu.

# Target-feature gating

Every SIMD entry point carries
`#[target_feature(enable = "sse4.1,ssse3")]` so the file compiles
on any `x86_64-*-*` target regardless of default target-features
(`x86_64-unknown-linux-gnu` defaults to just SSE2). The callers in
`encode_into` / `decode_into` do a one-shot
`std::is_x86_feature_detected!` check at the entry point and route
to a scalar fallback (`encode_into_scalar` / `decode_into_scalar`)
for hosts without SSE4.1+SSSE3. The branch predicts perfectly so
per-call cost is negligible.

# Verified

  cargo test                                  — 47 tests pass (aarch64 native)
  cargo test --target x86_64-apple-darwin     — 47 tests pass (Rosetta)
  cargo clippy --all-targets -- -D warnings   — clean (aarch64)
  cargo clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings              — clean (x86 mac)
  cargo check --target x86_64-unknown-linux-gnu --all-targets — clean
  cargo fmt --all -- --check                  — clean

Rosetta dynamically retargets SSE→NEON on Apple Silicon, so it
validates correctness only — the perf numbers it produces are not
representative of native x86. CI's Ubuntu x86_64 runner will give
the first real perf data point.
Rust 1.85 still applies `unsafe_op_in_unsafe_fn` to target-feature-
gated intrinsics inside an `unsafe fn`, so calls to `_mm_set1_epi32`
etc. need explicit `unsafe { }` blocks even though the surrounding
function is itself `unsafe fn` with matching `target_feature`. A
later toolchain change relaxes that requirement and instead flags
the same blocks as `unused_unsafe`.

The fix:
  - Re-add the inner `unsafe { ... }` blocks around each SSE function
    body so 1.85 accepts them.
  - Add `#![allow(unused_unsafe)]` at the module level for both
    `src/ops/x86_64.rs` and the `src/block.rs::sse` submodule so
    newer toolchains don't break CI on the spurious warning.

Verified locally:
  cargo +1.85 test --target x86_64-apple-darwin   — 47 tests pass
  cargo +1.85 clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings                  — clean
  cargo clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings                  — clean (stable)
  cargo test                                      — 47 tests pass (native)
@coderdan coderdan mentioned this pull request May 2, 2026
coderdan added 2 commits May 2, 2026 14:16
The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per
call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each
__m256i register independently run the algorithm we developed for
SSE4.1, doubling per-call throughput. AVX2's lane-restricted
byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every
step of the algorithm is per-lane anyway.

# Why drop SSE4.1 entirely

User direction: "if a target doesn't have avx2 it can fall back to
the soft impl". AVX2 has been universal on x86 server hardware since
Haswell (2013) and the bench workflow now guards on its presence
(PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't
worth the duplicated code.

# What changed

src/ops/x86_64.rs:
  - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i.
  - target_feature attribute changed from "sse4.1,ssse3" to "avx2".

src/block.rs::sse → src/block.rs::avx2:
  - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8
    (16-byte → 32-byte input, 20-byte → 40-byte output).
  - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in,
    16-byte → 32-byte out).
  - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 +
    _mm256_broadcastsi128_si256 to populate both 128-bit lanes.
  - Encoder output is non-contiguous between lanes (lane 0 → out[0..20],
    lane 1 → out[20..40]) so we extract each 128-bit lane and the
    two 4-byte tails separately. Decoder output is 32 contiguous
    bytes so a single _mm256_storeu_si256 works.
  - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap
    at the tail boundary of each side); two _mm_loadu_si128s are
    combined with _mm256_set_m128i.

src/lib.rs:
  - encode_into / decode_into runtime gate switches from
    is_x86_feature_detected!("sse4.1") && ...("ssse3") to just
    ...("avx2"). Falls back to the existing scalar paths otherwise.
  - Loop strides updated for the new chunk sizes (32→40 encode,
    40→32 decode).

# Verified locally

  cargo test --target x86_64-apple-darwin   — 47 tests pass (Rosetta;
                                              hits the scalar fallback
                                              because Rosetta on this
                                              Mac doesn't expose AVX2)
  cargo test                                — 47 tests (aarch64 native)
  cargo +1.85 test --target x86_64-apple-darwin — 47 tests
  cargo clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings            — clean
  cargo +1.85 clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings            — clean
  cargo check --target x86_64-unknown-linux-gnu --all-targets — clean
  cargo fmt --all -- --check                — clean

CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run
that actually exercises the new AVX2 code paths end-to-end.
Splits the benchmarks section into per-architecture tables (aarch64
NEON / x86_64 AVX2) with the steady-state numbers from the GHA
Ubuntu runner. Also rewords the top-of-README description so it
mentions both SIMD targets up front instead of just NEON.

Steady-state at ≥ 256 B:

  aarch64 NEON: 4.40 GiB/s encode (1.61× ref), 4.49 GiB/s decode (10.5× ref)
  x86_64 AVX2:  2.07 GiB/s encode (2.71× ref), 2.32 GiB/s decode (9.6× ref)

NEON sustains roughly 2× the absolute throughput of AVX2 because its
`vqtbl4q_u8` does a 64-entry table lookup in one instruction, where
PSHUFB is limited to 16 entries (so the alphabet lookup expands to
~6 PSHUFB+OR per chunk on x86). AVX-512 VBMI's `vpermb` would close
that gap but isn't on the GHA fleet.
@coderdan coderdan changed the title Scaffold x86_64 SSE encode/decode (WIP) x86_64 AVX2 encode + decode May 2, 2026
@coderdan coderdan marked this pull request as ready for review May 2, 2026 04:39
Comment thread src/ops/x86_64.rs Outdated
Comment thread src/ops/x86_64.rs Outdated
Comment thread src/ops/x86_64.rs Outdated
Comment thread src/lib.rs Outdated
Comment thread src/block.rs Outdated
Comment thread src/block.rs
coderdan added 2 commits May 2, 2026 18:49
…olidate scalar paths, hoist shared tables

- AVX2 module + ops::x86_64 use `#![allow(unsafe_op_in_unsafe_fn)]` at
  module scope, dropping every inner `unsafe { … }` block. Same code
  compiles on Rust 1.85 (MSRV) and on toolchains that would otherwise
  flag the wrappers as unused. Mirrors the comment thread on PR #6.
- Delete duplicated quickcheck tests in `ops::aarch64` and `ops::x86_64`.
  `tests/base85_parity.rs` round-trips through these on every iteration,
  and the dedicated tests added zero unique coverage (and required a
  runtime AVX2 skip that we'd rather not have).
- Collapse `encode_into_scalar` / `decode_into_scalar` (x86 fallback) and
  the `not(any(aarch64, x86_64))` `encode_into` / `decode_into` into a
  single `encode_scalar` / `decode_scalar` gated on `not(aarch64)`. The
  non-aarch64-non-x86_64 arm becomes a one-line wrapper.
- Hoist `IDX_OUT_1`, `IDX_OUT_2`, and `TAIL_VALID_MASK` to file scope in
  `block.rs` so NEON and AVX2 share one definition each. `REV32_IDX`
  and `IDX_DEC` stay in `mod avx2` — their shape is x86-specific (NEON
  uses `vrev32q_u8` and a 32-byte `vqtbl2q_u8` source instead).

No behavioural change. Verified: cargo fmt, cargo clippy --all-targets,
cargo test (24 lib + 18 parity + 1 doctest) on stable + 1.85, both
aarch64-apple-darwin and x86_64-apple-darwin.
- Shorten the `#![allow(unsafe_op_in_unsafe_fn)]` rationale in mod avx2
  and in src/ops/x86_64.rs to one sentence — the original prose read as
  change-history narration.
- Drop the "tests removed because…" tombstones at the bottom of
  src/ops/{aarch64,x86_64}.rs.
- Drop the explanatory comment on REV32_IDX in mod avx2; the location
  already says "x86 only".
- Drop redundant doc comments on encode_scalar / decode_scalar that
  just restated the cfg gate.
- Fix stale `crate::encode_into_scalar / crate::decode_into_scalar`
  cross-reference in block.rs (now `encode_scalar / decode_scalar`).
@coderdan coderdan merged commit 4138aa8 into main May 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant