Conversation
Stands up the cfg-gated module structure for an x86_64 SIMD
implementation, wired into encode_into / decode_into the same way
the aarch64 NEON path is. Bodies are scalar-fallback stubs for now
— the structural contract (function signatures, magic constants,
test harness) matches aarch64 so swapping in intrinsics one helper
at a time is safe.
# What's here
src/ops/x86_64.rs — div_85, div_85_sq, div_85_cube
(same magic constants as aarch64.rs;
currently scalar-implemented stubs).
Same quickcheck tests as the NEON path.
src/ops/mod.rs — cfg-arm exporting div_* on x86_64.
src/block.rs::sse — `SseEncoder` + `try_decode_block_x4`
stubs that delegate to the scalar
block functions. Each is annotated
with a comment block sketching the
target intrinsic chain (load, divmod,
shuffle, store).
src/lib.rs — encode_into / decode_into gain a
#[cfg(target_arch = "x86_64")] arm
mirroring the aarch64 one. The
scalar-fallback gate becomes
cfg(not(any(aarch64, x86_64))).
# How to develop x86 SIMD locally on Apple Silicon
Two checked-in workflows:
rustup target add x86_64-apple-darwin
cargo test --target x86_64-apple-darwin # runs under Rosetta
cargo build --target x86_64-apple-darwin # cross-compile only
cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings
Rosetta translates SSE/AVX→NEON dynamically, so it's good for
correctness but not for benchmarking. AVX-512 is unsupported.
For real x86 perf numbers, run `cargo bench` on the GitHub
Actions Ubuntu x86_64 runner (already in CI) or a cloud VM.
# How to fill in the SIMD bodies
Each stub has a numbered sketch of the intrinsic chain in its
docstring. The aarch64 → x86 mapping cheat-sheet lives at the top
of `src/ops/x86_64.rs`. The order I'd take is:
1. `div_magic` in src/ops/x86_64.rs (the simplest function,
just `_mm_mul_epu32` + `_mm_srli_epi32`). Quickcheck against
the existing tests in that file.
2. `SseEncoder::encode_block_x4` next, since the encoder's
parallel-magic structure is straightforward to port.
3. `try_decode_block_x4` last, with the same overflow-detection
trick the NEON version uses.
# Verified locally
cargo test — 47 tests pass on aarch64
cargo test --target x86_64-apple-darwin — 47 tests pass via Rosetta
cargo clippy --all-targets -- -D warnings — clean (aarch64)
cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean
cargo check --target x86_64-unknown-linux-gnu --all-targets — clean
cargo fmt --all -- --check — clean
Replaces the scalar-fallback stubs in `src/block.rs::sse` and
`src/ops/x86_64.rs` with real SIMD bodies. Algorithm shape is
unchanged from the NEON path; only the intrinsics differ.
# div_magic (src/ops/x86_64.rs)
Lane-wise libdivide:
m = _mm_set1_epi32(MAGIC)
even = _mm_mul_epu32(input, m) // touches lanes 0, 2
odd = _mm_mul_epu32(_mm_srli_epi64::<32>(input), m) // lanes 1, 3
pick the high u32 of each u64 product via _mm_shuffle_epi32::<0xF5>
blend the two halves with _mm_blend_epi16::<0xCC>
finally _mm_srli_epi32::<SHIFT>
# encode_block_x4 (src/block.rs::sse)
Same parallel-magic structure as NEON: q1, q2, q3 in parallel from n,
then q4 = q2 / 85². Five digits via lane-wise multiply-subtract
(_mm_mullo_epi32 + _mm_sub_epi32). Splice digits into out_1 (16 bytes)
+ out_2 (4 bytes) via PSHUFB+OR chains using the same byte-position
indexes as NEON. Finally byte_to_char85_x86 (digit → ASCII).
# byte_to_char85_x86 — corrected approach
Hit a real x86 gotcha here. PSHUFB only zeros its output when bit 7
of the index byte is set; indices in 16..127 produce
`table[index & 0x0F]`. The naive "subtract chunk-base, PSHUFB, OR"
pattern that works on NEON's TBX (which has merge semantics)
silently produces wrong bytes for most inputs (e.g. digit 84
returns t0[4]='4' from chunk 0 instead of 0).
Fix: split each digit into a 4-bit high-nibble (chunk index 0..5)
and a 4-bit low-nibble (entry within chunk). Run all 6 chunk
PSHUFBs unconditionally with the low nibble; mask each result with
`high_nib == N` (cmpeq) and OR. This costs ~22 ops vs NEON's
single-instruction vqtbl4q_u8, which is intrinsic to x86 PSHUFB
having only one 16-byte source.
# try_decode_block_x4 (src/block.rs::sse)
Mirrors NEON: load 16+4 chars, range-validate [33, 126] via cmpgt,
ASCII→digit via the same chunk_idx-selection trick (96-byte
CHAR_TO_85_PADDED table), permute to per-position vectors via two
PSHUFBs (main + tail) ORed together, Horner via mullo+add, overflow
detection via the bias-and-signed-cmpgt unsigned-cmplt trick
(SSE has no native unsigned cmplt), big-endian store via PSHUFB +
storeu.
# Target-feature gating
Every SIMD entry point carries
`#[target_feature(enable = "sse4.1,ssse3")]` so the file compiles
on any `x86_64-*-*` target regardless of default target-features
(`x86_64-unknown-linux-gnu` defaults to just SSE2). The callers in
`encode_into` / `decode_into` do a one-shot
`std::is_x86_feature_detected!` check at the entry point and route
to a scalar fallback (`encode_into_scalar` / `decode_into_scalar`)
for hosts without SSE4.1+SSSE3. The branch predicts perfectly so
per-call cost is negligible.
# Verified
cargo test — 47 tests pass (aarch64 native)
cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta)
cargo clippy --all-targets -- -D warnings — clean (aarch64)
cargo clippy --target x86_64-apple-darwin
--all-targets -- -D warnings — clean (x86 mac)
cargo check --target x86_64-unknown-linux-gnu --all-targets — clean
cargo fmt --all -- --check — clean
Rosetta dynamically retargets SSE→NEON on Apple Silicon, so it
validates correctness only — the perf numbers it produces are not
representative of native x86. CI's Ubuntu x86_64 runner will give
the first real perf data point.
Rust 1.85 still applies `unsafe_op_in_unsafe_fn` to target-feature-
gated intrinsics inside an `unsafe fn`, so calls to `_mm_set1_epi32`
etc. need explicit `unsafe { }` blocks even though the surrounding
function is itself `unsafe fn` with matching `target_feature`. A
later toolchain change relaxes that requirement and instead flags
the same blocks as `unused_unsafe`.
The fix:
- Re-add the inner `unsafe { ... }` blocks around each SSE function
body so 1.85 accepts them.
- Add `#![allow(unused_unsafe)]` at the module level for both
`src/ops/x86_64.rs` and the `src/block.rs::sse` submodule so
newer toolchains don't break CI on the spurious warning.
Verified locally:
cargo +1.85 test --target x86_64-apple-darwin — 47 tests pass
cargo +1.85 clippy --target x86_64-apple-darwin
--all-targets -- -D warnings — clean
cargo clippy --target x86_64-apple-darwin
--all-targets -- -D warnings — clean (stable)
cargo test — 47 tests pass (native)
Merged
The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each __m256i register independently run the algorithm we developed for SSE4.1, doubling per-call throughput. AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every step of the algorithm is per-lane anyway. # Why drop SSE4.1 entirely User direction: "if a target doesn't have avx2 it can fall back to the soft impl". AVX2 has been universal on x86 server hardware since Haswell (2013) and the bench workflow now guards on its presence (PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't worth the duplicated code. # What changed src/ops/x86_64.rs: - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i. - target_feature attribute changed from "sse4.1,ssse3" to "avx2". src/block.rs::sse → src/block.rs::avx2: - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8 (16-byte → 32-byte input, 20-byte → 40-byte output). - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in, 16-byte → 32-byte out). - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 + _mm256_broadcastsi128_si256 to populate both 128-bit lanes. - Encoder output is non-contiguous between lanes (lane 0 → out[0..20], lane 1 → out[20..40]) so we extract each 128-bit lane and the two 4-byte tails separately. Decoder output is 32 contiguous bytes so a single _mm256_storeu_si256 works. - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap at the tail boundary of each side); two _mm_loadu_si128s are combined with _mm256_set_m128i. src/lib.rs: - encode_into / decode_into runtime gate switches from is_x86_feature_detected!("sse4.1") && ...("ssse3") to just ...("avx2"). Falls back to the existing scalar paths otherwise. - Loop strides updated for the new chunk sizes (32→40 encode, 40→32 decode). # Verified locally cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta; hits the scalar fallback because Rosetta on this Mac doesn't expose AVX2) cargo test — 47 tests (aarch64 native) cargo +1.85 test --target x86_64-apple-darwin — 47 tests cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run that actually exercises the new AVX2 code paths end-to-end.
Splits the benchmarks section into per-architecture tables (aarch64 NEON / x86_64 AVX2) with the steady-state numbers from the GHA Ubuntu runner. Also rewords the top-of-README description so it mentions both SIMD targets up front instead of just NEON. Steady-state at ≥ 256 B: aarch64 NEON: 4.40 GiB/s encode (1.61× ref), 4.49 GiB/s decode (10.5× ref) x86_64 AVX2: 2.07 GiB/s encode (2.71× ref), 2.32 GiB/s decode (9.6× ref) NEON sustains roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8` does a 64-entry table lookup in one instruction, where PSHUFB is limited to 16 entries (so the alphabet lookup expands to ~6 PSHUFB+OR per chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but isn't on the GHA fleet.
coderdan
commented
May 2, 2026
coderdan
commented
May 2, 2026
coderdan
commented
May 2, 2026
coderdan
commented
May 2, 2026
coderdan
commented
May 2, 2026
coderdan
commented
May 2, 2026
…olidate scalar paths, hoist shared tables
- AVX2 module + ops::x86_64 use `#![allow(unsafe_op_in_unsafe_fn)]` at
module scope, dropping every inner `unsafe { … }` block. Same code
compiles on Rust 1.85 (MSRV) and on toolchains that would otherwise
flag the wrappers as unused. Mirrors the comment thread on PR #6.
- Delete duplicated quickcheck tests in `ops::aarch64` and `ops::x86_64`.
`tests/base85_parity.rs` round-trips through these on every iteration,
and the dedicated tests added zero unique coverage (and required a
runtime AVX2 skip that we'd rather not have).
- Collapse `encode_into_scalar` / `decode_into_scalar` (x86 fallback) and
the `not(any(aarch64, x86_64))` `encode_into` / `decode_into` into a
single `encode_scalar` / `decode_scalar` gated on `not(aarch64)`. The
non-aarch64-non-x86_64 arm becomes a one-line wrapper.
- Hoist `IDX_OUT_1`, `IDX_OUT_2`, and `TAIL_VALID_MASK` to file scope in
`block.rs` so NEON and AVX2 share one definition each. `REV32_IDX`
and `IDX_DEC` stay in `mod avx2` — their shape is x86-specific (NEON
uses `vrev32q_u8` and a 32-byte `vqtbl2q_u8` source instead).
No behavioural change. Verified: cargo fmt, cargo clippy --all-targets,
cargo test (24 lib + 18 parity + 1 doctest) on stable + 1.85, both
aarch64-apple-darwin and x86_64-apple-darwin.
- Shorten the `#![allow(unsafe_op_in_unsafe_fn)]` rationale in mod avx2
and in src/ops/x86_64.rs to one sentence — the original prose read as
change-history narration.
- Drop the "tests removed because…" tombstones at the bottom of
src/ops/{aarch64,x86_64}.rs.
- Drop the explanatory comment on REV32_IDX in mod avx2; the location
already says "x86 only".
- Drop redundant doc comments on encode_scalar / decode_scalar that
just restated the cfg gate.
- Fix stale `crate::encode_into_scalar / crate::decode_into_scalar`
cross-reference in block.rs (now `encode_scalar / decode_scalar`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an AVX2-accelerated SIMD path for x86_64, parallel to the existing aarch64 NEON path. Each AVX2 iteration processes 8 base85 blocks (32 input bytes → 40 output chars on encode; 40 chars → 32 output bytes on decode) by running the SSE4.1 algorithm in parallel across both 128-bit lanes of an `__m256i`.
Hosts without AVX2 fall back to the scalar implementation via a one-shot `is_x86_feature_detected!("avx2")` check at the public API entry. The branch predicts perfectly across calls, so the cost is negligible.
Algorithm
Lane-restricted AVX2 byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine here because every step of the algorithm is per-lane anyway — each 128-bit half does an independent 4-block computation, equivalent to running the SSE4.1 algorithm twice in parallel within one register.
Performance (GHA Ubuntu, AMD EPYC 7763)
Steady-state at ≥ 256 B:
Full per-size tables in the README. The decode ratio (~10×) matches what NEON gets on aarch64. Absolute throughput is roughly half of NEON (4.4 GiB/s) because PSHUFB is a 16-entry shuffle vs NEON's 64-entry TBL — AVX-512 VBMI's `vpermb` would close the gap but isn't on the runner fleet.
Small-input note: 16 B encode is slightly slower than the reference because AVX2's 32-byte chunk doesn't fit, so we hit the scalar fallback with one extra branch of overhead. Crossover happens around 64 B.
Notes
Test plan