diff --git a/README.md b/README.md index c8286d6..93bef02 100644 --- a/README.md +++ b/README.md @@ -2,10 +2,11 @@ Fast Base85 (RFC 1924 / Z85-style) encoder and decoder for Rust. -On `aarch64` the encode path uses NEON intrinsics to process 16 input bytes -per iteration; the decode path and the fallback for other architectures use -a portable scalar implementation. Output is byte-for-byte compatible with -the [`base85`](https://crates.io/crates/base85) crate. +SIMD-accelerated on aarch64 (NEON, 4 blocks per iteration) and x86_64 +(AVX2, 8 blocks per iteration), with a portable scalar fallback for +everything else and for x86_64 hosts lacking AVX2 (rare on server +hardware after ~2013). Output is byte-for-byte compatible with the +[`base85`](https://crates.io/crates/base85) crate. ## Usage @@ -33,7 +34,10 @@ for decoding. ## Status - Public API: `encode(&[u8]) -> String`, `decode(&str) -> Result, DecodeError>`. -- Encode and decode are both NEON-accelerated on `aarch64`, scalar elsewhere. +- aarch64: NEON-accelerated (4 blocks at a time, always available on aarch64). +- x86_64 with AVX2: AVX2-accelerated (8 blocks at a time). Runtime feature + detection at the public API entry — hosts without AVX2 fall back to scalar. +- Other architectures: portable scalar implementation. - The decode path validates char range and detects `u32` overflow lane-wise; any invalid input falls back to the scalar path so the resulting `DecodeError` carries a precise byte position. @@ -74,43 +78,82 @@ strictly more diagnostic information. Code that pattern-matches on ## Benchmarks -Apple M-series (aarch64), `cargo bench --bench encode`, criterion, release -profile, single-threaded. Times are the criterion-reported median; throughput -is computed from the median. - -### Encode - -| size | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput | -|-------|--------------:|-------------------:|--------:|-------------------------:| -| 16 B | 17.4 ns | 16.3 ns | 1.07× | ~940 MiB/s | -| 64 B | 33.0 ns | 22.6 ns | 1.46× | 2.71 GiB/s | -| 256 B | 115.7 ns | 73.2 ns | 1.58× | 3.26 GiB/s | -| 1 KiB | 378.3 ns | 225.1 ns | 1.68× | 4.24 GiB/s | -| 16 KiB| 5.56 µs | 3.60 µs | 1.55× | 4.24 GiB/s | -| 256 KiB| 89.3 µs | 55.4 µs | 1.61× | 4.41 GiB/s | -| 1 MiB | 356 µs | 222 µs | 1.61× | 4.40 GiB/s | - -### Decode - -| size | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput | -|-------|--------------:|-------------------:|--------:|-------------------------:| -| 16 B | 32.4 ns | 14.8 ns | 2.18× | ~1.0 GiB/s | -| 64 B | 123.5 ns | 24.6 ns | 5.02× | 2.42 GiB/s | -| 256 B | 579 ns | 57.6 ns | 10.05× | 4.14 GiB/s | -| 1 KiB | 2.27 µs | 226 ns | 10.06× | 4.22 GiB/s | -| 16 KiB| 36.8 µs | 3.49 µs | 10.55× | 4.38 GiB/s | -| 256 KiB| 591 µs | 54.3 µs | 10.89× | 4.50 GiB/s | -| 1 MiB | 2.28 ms | 217.6 µs | 10.49× | 4.49 GiB/s | +`cargo bench --bench encode`, criterion, release profile, single-threaded. +Times are the criterion-reported median; throughput computed from it. + +### aarch64 (Apple M-series) + +#### Encode + +| size | `base85` | `base85-simd` (NEON) | speedup | throughput | +|-------|---------:|---------------------:|--------:|-----------:| +| 16 B | 17.4 ns | 16.3 ns | 1.07× | ~940 MiB/s | +| 64 B | 33.0 ns | 22.6 ns | 1.46× | 2.71 GiB/s | +| 256 B | 115.7 ns | 73.2 ns | 1.58× | 3.26 GiB/s | +| 1 KiB | 378.3 ns | 225.1 ns | 1.68× | 4.24 GiB/s | +| 16 KiB| 5.56 µs | 3.60 µs | 1.55× | 4.24 GiB/s | +| 256 KiB| 89.3 µs | 55.4 µs | 1.61× | 4.41 GiB/s | +| 1 MiB | 356 µs | 222 µs | 1.61× | 4.40 GiB/s | + +#### Decode + +| size | `base85` | `base85-simd` (NEON) | speedup | throughput | +|-------|---------:|---------------------:|--------:|-----------:| +| 16 B | 32.4 ns | 14.8 ns | 2.18× | ~1.0 GiB/s | +| 64 B | 123.5 ns | 24.6 ns | 5.02× | 2.42 GiB/s | +| 256 B | 579 ns | 57.6 ns | 10.05× | 4.14 GiB/s | +| 1 KiB | 2.27 µs | 226 ns | 10.06× | 4.22 GiB/s | +| 16 KiB| 36.8 µs | 3.49 µs | 10.55× | 4.38 GiB/s | +| 256 KiB| 591 µs | 54.3 µs | 10.89× | 4.50 GiB/s | +| 1 MiB | 2.28 ms | 217.6 µs | 10.49× | 4.49 GiB/s | + +### x86_64 (AMD EPYC 7763, Zen 3) + +Numbers from a GitHub Actions hosted Ubuntu runner — shared/virtualised +hardware so noise is higher than aarch64 (~5–15% variance), but the +relative speedups are stable. + +#### Encode + +| size | `base85` | `base85-simd` (AVX2) | speedup | throughput | +|-------|---------:|---------------------:|--------:|-----------:| +| 16 B | 41.0 ns | 57.8 ns | 0.71× | (scalar fallback; chunk doesn't fit) | +| 64 B | 100.5 ns | 61.7 ns | 1.63× | 989 MiB/s | +| 256 B | 341.9 ns | 165.9 ns | 2.06× | 1.44 GiB/s | +| 1 KiB | 1.32 µs | 507.0 ns | 2.61× | 1.88 GiB/s | +| 16 KiB| 20.4 µs | 7.48 µs | 2.72× | 2.04 GiB/s | +| 256 KiB| 323.9 µs| 118.9 µs | 2.72× | 2.05 GiB/s | +| 1 MiB | 1.31 ms | 482.9 µs | 2.71× | 2.07 GiB/s | + +#### Decode + +| size | `base85` | `base85-simd` (AVX2) | speedup | throughput | +|-------|---------:|---------------------:|--------:|-----------:| +| 16 B | 70.2 ns | 61.2 ns | 1.15× | (scalar fallback) | +| 64 B | 244.8 ns | 67.6 ns | 3.62× | 903 MiB/s | +| 256 B | 1.046 µs | 164.4 ns | 6.36× | 1.45 GiB/s | +| 1 KiB | 4.19 µs | 466.5 ns | 8.98× | 2.04 GiB/s | +| 16 KiB| 65.7 µs | 6.75 µs | 9.73× | 2.26 GiB/s | +| 256 KiB| 1.058 ms| 107.2 µs | 9.86× | 2.28 GiB/s | +| 1 MiB | 4.14 ms | 431.6 µs | 9.59× | 2.32 GiB/s | ### Steady-state summary -At sizes large enough to amortise loop setup (≥ 256 B), `base85-simd` -sustains **~4.4 GiB/s** for both encode and decode on Apple M-series, -roughly **1.6× faster** than the reference for encode and **~10× -faster** for decode. The decode advantage comes from `TBL`-based ASCII -→ digit lookup that replaces the reference's per-character branchy -match; the encode advantage comes from 4-lane parallel `divmod 85` -plus `TBX`-based digit → ASCII / output permutation. +At sizes large enough to amortise the SIMD loop setup (≥ 256 B): + +| arch / ISA | encode throughput | encode speedup | decode throughput | decode speedup | +|-------------|------------------:|---------------:|------------------:|---------------:| +| aarch64 NEON| 4.40 GiB/s | 1.61× | 4.49 GiB/s | 10.49× | +| x86_64 AVX2 | 2.07 GiB/s | 2.71× | 2.32 GiB/s | 9.59× | + +The decode speedup ratio is roughly the same on both architectures +(~10×), driven by SIMD-accelerated ASCII → digit table lookup +replacing the reference's per-character branchy match. NEON sustains +roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8` +does a 64-entry lookup in a single instruction, where x86 PSHUFB is +limited to 16 entries (so the lookup expands to ~6 PSHUFB+OR per +chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but +isn't available on the AMD silicon used by GitHub's runner fleet. Reproduce with: diff --git a/src/block.rs b/src/block.rs index ef040fc..3a66f35 100644 --- a/src/block.rs +++ b/src/block.rs @@ -150,6 +150,48 @@ pub(crate) fn decode_tail( Ok(()) } +// ───────────────────────────────────────────────────────────────────────── +// SIMD shared tables (used by both NEON and AVX2 paths) +// ───────────────────────────────────────────────────────────────────────── + +// Per-digit splice indexes for the 4-block encode shuffle. Output position +// of digit k of block i is `5*i + k`; positions ≥ 16 spill into out_2 (with +// offset −16). Each v_dk is a u32x4 with the digit in byte 0/4/8/12 of its +// lane, hence the source offsets `4*i`. Both NEON (vqtbx1q_u8) and AVX2 +// (PSHUFB, broadcast to both 128-bit halves) consume the identical patterns. + +#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))] +#[rustfmt::skip] +pub(super) static IDX_OUT_1: [[u8; 16]; 5] = [ + // k=0: out_1[0,5,10,15] ← src bytes 0,4,8,12 + [0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF, 12], + // k=1: out_1[1,6,11] ← 0,4,8 (block 3 → out_2) + [0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF], + // k=2: out_1[2,7,12] ← 0,4,8 + [0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF], + // k=3: out_1[3,8,13] ← 0,4,8 + [0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF], + // k=4: out_1[4,9,14] ← 0,4,8 + [0xFF, 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF], +]; + +// Indexes for digits 1..4 of block 3 (src byte 12) spilling into out_2. +#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))] +#[rustfmt::skip] +pub(super) static IDX_OUT_2: [[u8; 16]; 4] = [ + [12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], +]; + +// Mask used by the SIMD decoders: 0xFF for the 4 real-input lanes of the +// padded tail register, 0 for the 12 padding lanes. Lets us suppress +// validation noise from the padding bytes. +#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))] +pub(super) static TAIL_VALID_MASK: [u8; 16] = + [0xFF, 0xFF, 0xFF, 0xFF, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; + // ───────────────────────────────────────────────────────────────────────── // NEON encoder (aarch64) // ───────────────────────────────────────────────────────────────────────── @@ -172,52 +214,7 @@ mod neon { vreinterpretq_u8_u32, vreinterpretq_u32_u8, vrev32q_u8, vst1q_u8, vsubq_u8, }; - // Per-digit splice indexes, computed at compile time and loaded - // from .rodata each call. Output position of digit k of block i is - // `5*i + k`; positions ≥ 16 spill into out_2 (with offset −16). - // Each v_dk is a u32x4 with the digit in byte 0/4/8/12 of its lane, - // hence the source offsets `4*i`. - static IDX_OUT_1: [[u8; 16]; 5] = [ - // k=0: out_1[0,5,10,15] ← src bytes 0,4,8,12 - [ - 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF, 12, - ], - // k=1: out_1[1,6,11] ← 0,4,8 (block 3 → out_2) - [ - 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF, - ], - // k=2: out_1[2,7,12] ← 0,4,8 - [ - 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, - ], - // k=3: out_1[3,8,13] ← 0,4,8 - [ - 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, - ], - // k=4: out_1[4,9,14] ← 0,4,8 - [ - 0xFF, 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, - ], - ]; - // Indexes for digits 1..4 spilling into out_2 (block 3, src byte 12). - static IDX_OUT_2: [[u8; 16]; 4] = [ - [ - 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0xFF, - ], - [ - 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0xFF, - ], - [ - 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0xFF, - ], - [ - 0xFF, 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, - 0xFF, - ], - ]; + use super::{IDX_OUT_1, IDX_OUT_2, TAIL_VALID_MASK}; /// Encodes 16 input bytes into 20 output characters using NEON. /// @@ -339,10 +336,6 @@ mod neon { tail_buf[..4].copy_from_slice(&input[16..20]); let chars_tail = unsafe { vld1q_u8(tail_buf.as_ptr()) }; - // Mask: 0xFF for the 4 real-input lanes of `chars_tail`, 0 for the - // 12 padding lanes — used to suppress validation noise from padding. - static TAIL_VALID_MASK: [u8; 16] = - [0xFF, 0xFF, 0xFF, 0xFF, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]; let tail_valid_mask = unsafe { vld1q_u8(TAIL_VALID_MASK.as_ptr()) }; // Char range validation: any byte not in [33, 126] is invalid. @@ -514,6 +507,386 @@ mod neon { } } +// ───────────────────────────────────────────────────────────────────────── +// AVX2 encoder / decoder (x86_64) +// ───────────────────────────────────────────────────────────────────────── +// +// Processes 8 blocks per call (32 input bytes, 40 output chars) by +// running the same SSE4.1 algorithm across both 128-bit lanes of an +// `__m256i`. AVX2's lane-restricted byte/word ops (PSHUFB, PSHUFD, +// BLEND) work fine here because the algorithm is fully per-lane — +// each lane handles 4 independent blocks. +// +// On hosts without AVX2 the crate's runtime feature gate routes to +// the scalar fallback (see `crate::encode_scalar` / `crate::decode_scalar`). + +#[cfg(target_arch = "x86_64")] +pub(crate) use avx2::{Avx2Encoder, try_decode_block_x8}; + +#[cfg(target_arch = "x86_64")] +mod avx2 { + //! AVX2 encode/decode — 8 blocks per call. + //! + //! Both 128-bit lanes of each `__m256i` independently run the + //! algorithm we developed for SSE4.1 (4 blocks per lane, 8 total). + //! AVX2 byte/word ops are lane-restricted (PSHUFB / BLEND / + //! SHUFFLE_EPI32 / etc. operate per 128-bit half), but every + //! step of the algorithm is per-lane anyway, so that's fine. + //! + //! Every entry point is `unsafe fn` with + //! `#[target_feature(enable = "avx2")]`. The caller + //! (`crate::encode_into` / `crate::decode_into`) does a one-shot + //! `is_x86_feature_detected!("avx2")` check; on hosts lacking + //! AVX2 the crate routes to the scalar fallback. + //! + //! The module-level `#![allow(unsafe_op_in_unsafe_fn)]` keeps the + //! body of each `unsafe fn` an implicit unsafe scope on MSRV 1.85 + //! (without it, each intrinsic call needs its own `unsafe { … }`). + + #![allow(unsafe_op_in_unsafe_fn)] + #![allow(clippy::indexing_slicing)] + + use crate::ops::{div_85, div_85_cube, div_85_sq}; + use std::arch::x86_64::{ + __m128i, __m256i, _mm_loadu_si128, _mm_storeu_si128, _mm256_add_epi32, _mm256_and_si256, + _mm256_broadcastsi128_si256, _mm256_cmpeq_epi8, _mm256_cmpgt_epi8, _mm256_cmpgt_epi32, + _mm256_extract_epi32, _mm256_extracti128_si256, _mm256_loadu_si256, _mm256_movemask_epi8, + _mm256_mullo_epi32, _mm256_or_si256, _mm256_set_m128i, _mm256_set1_epi8, _mm256_set1_epi32, + _mm256_setzero_si256, _mm256_shuffle_epi8, _mm256_srli_epi16, _mm256_storeu_si256, + _mm256_sub_epi32, _mm256_xor_si256, + }; + + use super::{IDX_OUT_1, IDX_OUT_2, TAIL_VALID_MASK}; + + static REV32_IDX: [u8; 16] = [3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12]; + + #[rustfmt::skip] + static ALPHABET_PADDED: &[u8; 96] = + b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~\0\0\0\0\0\0\0\0\0\0\0"; + + #[rustfmt::skip] + static CHAR_TO_85_PADDED: [u8; 96] = [ + 62, 0xFF, 63, 64, 65, 66, 0xFF, 67, 68, 69, 70, 0xFF, 71, 0xFF, 0xFF, 0, + 1, 2, 3, 4, 5, 6, 7, 8, 9, 0xFF, 72, 73, 74, 75, 76, 77, + 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, + 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 0xFF, 0xFF, 0xFF, 78, 79, 80, + 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, + 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 81, 82, 83, 84, 0xFF, 0xFF, + ]; + + #[rustfmt::skip] + static IDX_DEC: [([u8; 16], [u8; 16]); 5] = [ + ( + [0, 0xFF, 0xFF, 0xFF, 5, 0xFF, 0xFF, 0xFF, 10, 0xFF, 0xFF, 0xFF, 15, 0xFF, 0xFF, 0xFF], + [0xFF; 16], + ), + ( + [1, 0xFF, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF, 11, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF], + ), + ( + [2, 0xFF, 0xFF, 0xFF, 7, 0xFF, 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 1, 0xFF, 0xFF, 0xFF], + ), + ( + [3, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 13, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF], + ), + ( + [4, 0xFF, 0xFF, 0xFF, 9, 0xFF, 0xFF, 0xFF, 14, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF], + [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 3, 0xFF, 0xFF, 0xFF], + ), + ]; + + /// Loads a 16-byte index from .rodata and broadcasts it to both + /// 128-bit lanes of an `__m256i`. + /// + /// # Safety + /// `ptr` must be valid for a 16-byte read. Caller has AVX2. + #[inline] + #[target_feature(enable = "avx2")] + unsafe fn broadcast16(ptr: *const u8) -> __m256i { + _mm256_broadcastsi128_si256(_mm_loadu_si128(ptr.cast::<__m128i>())) + } + + /// AVX2 8-block encoder. Mirrors the structure of NEON's + /// `NeonEncoder` but processes 8 input blocks (32 bytes → 40 chars) + /// per call. + pub(crate) struct Avx2Encoder; + + impl Avx2Encoder { + #[inline] + pub(crate) fn new() -> Self { + Self + } + + /// Encode exactly 32 input bytes into exactly 40 output bytes. + /// + /// # Safety + /// Requires AVX2 to be available on the host CPU. + #[inline] + #[target_feature(enable = "avx2")] + pub(crate) unsafe fn encode_block_x8(&self, in_bytes: &[u8; 32], out_bytes: &mut [u8; 40]) { + // Load 32 bytes (8 × 4-byte blocks) and reverse each 4-byte + // group to get 8 BE u32s. + let raw = _mm256_loadu_si256(in_bytes.as_ptr().cast::<__m256i>()); + let n = _mm256_shuffle_epi8(raw, broadcast16(REV32_IDX.as_ptr())); + + // Parallel-magic divides — 8 lanes wide. + let q1 = div_85(n); + let q2 = div_85_sq(n); + let q3 = div_85_cube(n); + let q4 = div_85_sq(q2); + + // Five digits per block via lane-wise multiply-subtract. + let m85 = _mm256_set1_epi32(85); + let d0 = q4; + let d1 = _mm256_sub_epi32(q3, _mm256_mullo_epi32(q4, m85)); + let d2 = _mm256_sub_epi32(q2, _mm256_mullo_epi32(q3, m85)); + let d3 = _mm256_sub_epi32(q1, _mm256_mullo_epi32(q2, m85)); + let d4 = _mm256_sub_epi32(n, _mm256_mullo_epi32(q1, m85)); + + // Splice 5 digits into out_1 (32 bytes; 16 per lane — each + // lane holds the first 16 chars of its 4-block group). + // PSHUFB is per-lane; the same index broadcast to both lanes + // does the right thing. + let idx0 = broadcast16(IDX_OUT_1[0].as_ptr()); + let idx1 = broadcast16(IDX_OUT_1[1].as_ptr()); + let idx2 = broadcast16(IDX_OUT_1[2].as_ptr()); + let idx3 = broadcast16(IDX_OUT_1[3].as_ptr()); + let idx4 = broadcast16(IDX_OUT_1[4].as_ptr()); + let p0 = _mm256_shuffle_epi8(d0, idx0); + let p1 = _mm256_shuffle_epi8(d1, idx1); + let p2 = _mm256_shuffle_epi8(d2, idx2); + let p3 = _mm256_shuffle_epi8(d3, idx3); + let p4 = _mm256_shuffle_epi8(d4, idx4); + let out_1 = _mm256_or_si256( + _mm256_or_si256(p0, p1), + _mm256_or_si256(p2, _mm256_or_si256(p3, p4)), + ); + + // Same for out_2 (8 bytes total; 4 per lane — the trailing + // 4 chars of each 4-block group). + let i20 = broadcast16(IDX_OUT_2[0].as_ptr()); + let i21 = broadcast16(IDX_OUT_2[1].as_ptr()); + let i22 = broadcast16(IDX_OUT_2[2].as_ptr()); + let i23 = broadcast16(IDX_OUT_2[3].as_ptr()); + let q21 = _mm256_shuffle_epi8(d1, i20); + let q22 = _mm256_shuffle_epi8(d2, i21); + let q23 = _mm256_shuffle_epi8(d3, i22); + let q24 = _mm256_shuffle_epi8(d4, i23); + let out_2 = _mm256_or_si256(_mm256_or_si256(q21, q22), _mm256_or_si256(q23, q24)); + + // Digit value (0..84) → ASCII char. + let out_1 = byte_to_char85_avx2(out_1); + let out_2 = byte_to_char85_avx2(out_2); + + // Store. Output layout is non-contiguous between lanes: + // bytes 0..16 — lane 0's 16-char main + // bytes 16..20 — lane 0's 4-char tail + // bytes 20..36 — lane 1's 16-char main + // bytes 36..40 — lane 1's 4-char tail + let lo_main = _mm256_extracti128_si256::<0>(out_1); + let hi_main = _mm256_extracti128_si256::<1>(out_1); + let lo_tail = _mm256_extract_epi32::<0>(out_2) as u32; + let hi_tail = _mm256_extract_epi32::<4>(out_2) as u32; + + _mm_storeu_si128(out_bytes.as_mut_ptr().cast::<__m128i>(), lo_main); + std::ptr::write_unaligned(out_bytes.as_mut_ptr().add(16).cast::(), lo_tail); + _mm_storeu_si128(out_bytes.as_mut_ptr().add(20).cast::<__m128i>(), hi_main); + std::ptr::write_unaligned(out_bytes.as_mut_ptr().add(36).cast::(), hi_tail); + } + } + + /// Maps 32 lane-wise digit values (0..84) to ASCII characters. + /// + /// Same chunk-index-selection trick as the SSE port (PSHUFB only + /// zeros on bit-7-set indexes), just doubled. + #[inline] + #[target_feature(enable = "avx2")] + unsafe fn byte_to_char85_avx2(x85: __m256i) -> __m256i { + let high_nib = _mm256_and_si256(_mm256_srli_epi16::<4>(x85), _mm256_set1_epi8(0x0F)); + let low_nib = _mm256_and_si256(x85, _mm256_set1_epi8(0x0F)); + + let t0 = broadcast16(ALPHABET_PADDED.as_ptr()); + let t1 = broadcast16(ALPHABET_PADDED.as_ptr().add(16)); + let t2 = broadcast16(ALPHABET_PADDED.as_ptr().add(32)); + let t3 = broadcast16(ALPHABET_PADDED.as_ptr().add(48)); + let t4 = broadcast16(ALPHABET_PADDED.as_ptr().add(64)); + let t5 = broadcast16(ALPHABET_PADDED.as_ptr().add(80)); + + let r0 = _mm256_shuffle_epi8(t0, low_nib); + let r1 = _mm256_shuffle_epi8(t1, low_nib); + let r2 = _mm256_shuffle_epi8(t2, low_nib); + let r3 = _mm256_shuffle_epi8(t3, low_nib); + let r4 = _mm256_shuffle_epi8(t4, low_nib); + let r5 = _mm256_shuffle_epi8(t5, low_nib); + + let m0 = _mm256_cmpeq_epi8(high_nib, _mm256_setzero_si256()); + let m1 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(1)); + let m2 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(2)); + let m3 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(3)); + let m4 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(4)); + let m5 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(5)); + + _mm256_or_si256( + _mm256_or_si256( + _mm256_or_si256(_mm256_and_si256(r0, m0), _mm256_and_si256(r1, m1)), + _mm256_or_si256(_mm256_and_si256(r2, m2), _mm256_and_si256(r3, m3)), + ), + _mm256_or_si256(_mm256_and_si256(r4, m4), _mm256_and_si256(r5, m5)), + ) + } + + /// Maps 32 ASCII chars to base85 digit values (0..84) lane-wise. + /// Chars not in the alphabet (or out of [33, 126]) come back as 0xFF. + #[inline] + #[target_feature(enable = "avx2")] + unsafe fn char85_to_byte_avx2(chars: __m256i) -> __m256i { + let normalised = _mm256_sub_epi32(chars, _mm256_set1_epi8(33)); + let high_nib = _mm256_and_si256(_mm256_srli_epi16::<4>(normalised), _mm256_set1_epi8(0x0F)); + let low_nib = _mm256_and_si256(normalised, _mm256_set1_epi8(0x0F)); + + let t0 = broadcast16(CHAR_TO_85_PADDED.as_ptr()); + let t1 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(16)); + let t2 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(32)); + let t3 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(48)); + let t4 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(64)); + let t5 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(80)); + + let r0 = _mm256_shuffle_epi8(t0, low_nib); + let r1 = _mm256_shuffle_epi8(t1, low_nib); + let r2 = _mm256_shuffle_epi8(t2, low_nib); + let r3 = _mm256_shuffle_epi8(t3, low_nib); + let r4 = _mm256_shuffle_epi8(t4, low_nib); + let r5 = _mm256_shuffle_epi8(t5, low_nib); + + let m0 = _mm256_cmpeq_epi8(high_nib, _mm256_setzero_si256()); + let m1 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(1)); + let m2 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(2)); + let m3 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(3)); + let m4 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(4)); + let m5 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(5)); + + _mm256_or_si256( + _mm256_or_si256( + _mm256_or_si256(_mm256_and_si256(r0, m0), _mm256_and_si256(r1, m1)), + _mm256_or_si256(_mm256_and_si256(r2, m2), _mm256_and_si256(r3, m3)), + ), + _mm256_or_si256(_mm256_and_si256(r4, m4), _mm256_and_si256(r5, m5)), + ) + } + + /// AVX2 8-block decoder. 40 chars → 32 bytes. + /// + /// Returns `Err(())` (so the caller falls back to the scalar path, + /// which will surface a precise [`crate::DecodeError`]) when: + /// - any input byte is outside the base85 alphabet, or + /// - any of the eight 5-char blocks decodes to a value > `u32::MAX`. + /// + /// # Safety + /// Requires AVX2 on the host CPU. + #[inline] + #[target_feature(enable = "avx2")] + pub(crate) unsafe fn try_decode_block_x8( + input: &[u8; 40], + out: &mut [u8; 32], + ) -> Result<(), ()> { + // Layout the 40 chars into the 256-bit register so each + // 128-bit lane handles 4 blocks. Lane 0 ← chars 0..15 + // (first 16 chars of blocks 0..3); lane 1 ← chars 20..35 + // (first 16 chars of blocks 4..7). The 4-char tails (chars + // 16..19 and 36..39) go into a separate "chars_tail" with + // 12 zero-padding bytes per lane. + let chars_main_lo = _mm_loadu_si128(input.as_ptr().cast::<__m128i>()); + let chars_main_hi = _mm_loadu_si128(input.as_ptr().add(20).cast::<__m128i>()); + let chars_main = _mm256_set_m128i(chars_main_hi, chars_main_lo); + + let mut tail_buf_lo = [0u8; 16]; + let mut tail_buf_hi = [0u8; 16]; + tail_buf_lo[..4].copy_from_slice(&input[16..20]); + tail_buf_hi[..4].copy_from_slice(&input[36..40]); + let chars_tail = _mm256_set_m128i( + _mm_loadu_si128(tail_buf_hi.as_ptr().cast::<__m128i>()), + _mm_loadu_si128(tail_buf_lo.as_ptr().cast::<__m128i>()), + ); + let tail_valid_mask = broadcast16(TAIL_VALID_MASK.as_ptr()); + + // Range validation: chars must be in [33, 126]. + let too_low_main = _mm256_cmpgt_epi8(_mm256_set1_epi8(33), chars_main); + let too_high_main = _mm256_cmpgt_epi8(chars_main, _mm256_set1_epi8(126)); + let invalid_range_main = _mm256_or_si256(too_low_main, too_high_main); + + let too_low_tail = _mm256_cmpgt_epi8(_mm256_set1_epi8(33), chars_tail); + let too_high_tail = _mm256_cmpgt_epi8(chars_tail, _mm256_set1_epi8(126)); + let invalid_range_tail = _mm256_and_si256( + _mm256_or_si256(too_low_tail, too_high_tail), + tail_valid_mask, + ); + + // ASCII → digit lookup (0xFF for in-range invalid chars). + let digits_main = char85_to_byte_avx2(chars_main); + let digits_tail = char85_to_byte_avx2(chars_tail); + + let invalid_dig_main = _mm256_cmpeq_epi8(digits_main, _mm256_set1_epi8(0xFF_u8 as i8)); + let invalid_dig_tail = _mm256_and_si256( + _mm256_cmpeq_epi8(digits_tail, _mm256_set1_epi8(0xFF_u8 as i8)), + tail_valid_mask, + ); + + let any_invalid = _mm256_or_si256( + _mm256_or_si256(invalid_range_main, invalid_range_tail), + _mm256_or_si256(invalid_dig_main, invalid_dig_tail), + ); + // _mm256_movemask_epi8 returns a 32-bit mask (one bit per + // byte of the 256-bit register). + if _mm256_movemask_epi8(any_invalid) != 0 { + return Err(()); + } + + // Permute digits into per-position vectors. Same indexes + // as the SSE path, broadcast to both lanes. + let v_dk = |idx_main: &[u8; 16], idx_tail: &[u8; 16]| -> __m256i { + let im = broadcast16(idx_main.as_ptr()); + let it = broadcast16(idx_tail.as_ptr()); + _mm256_or_si256( + _mm256_shuffle_epi8(digits_main, im), + _mm256_shuffle_epi8(digits_tail, it), + ) + }; + let v_d0 = v_dk(&IDX_DEC[0].0, &IDX_DEC[0].1); + let v_d1 = v_dk(&IDX_DEC[1].0, &IDX_DEC[1].1); + let v_d2 = v_dk(&IDX_DEC[2].0, &IDX_DEC[2].1); + let v_d3 = v_dk(&IDX_DEC[3].0, &IDX_DEC[3].1); + let v_d4 = v_dk(&IDX_DEC[4].0, &IDX_DEC[4].1); + + // Horner: n = ((((d0·85 + d1)·85 + d2)·85 + d3)·85 + d4) + let m85 = _mm256_set1_epi32(85); + let n = _mm256_add_epi32(v_d1, _mm256_mullo_epi32(v_d0, m85)); + let n = _mm256_add_epi32(v_d2, _mm256_mullo_epi32(n, m85)); + let n = _mm256_add_epi32(v_d3, _mm256_mullo_epi32(n, m85)); + let n = _mm256_add_epi32(v_d4, _mm256_mullo_epi32(n, m85)); + + // Overflow detection (bias-and-signed-cmpgt). + let d0_85_4 = _mm256_mullo_epi32(v_d0, _mm256_set1_epi32(52_200_625)); + let bias = _mm256_set1_epi32(0x80000000_u32 as i32); + let n_b = _mm256_xor_si256(n, bias); + let d0_b = _mm256_xor_si256(d0_85_4, bias); + let overflow_mask = _mm256_cmpgt_epi32(d0_b, n_b); + if _mm256_movemask_epi8(overflow_mask) != 0 { + return Err(()); + } + + // Byte-swap each u32 lane to BE and store. Output is 32 + // contiguous bytes (16 from each lane), so a single + // _mm256_storeu_si256 works. + let bytes_be = _mm256_shuffle_epi8(n, broadcast16(REV32_IDX.as_ptr())); + _mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>(), bytes_be); + + Ok(()) + } +} + #[cfg(test)] mod tests { use super::*; diff --git a/src/lib.rs b/src/lib.rs index 035490c..6a759cf 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -196,8 +196,69 @@ fn encode_into(input: &[u8], out: &mut [u8]) { } } -#[cfg(not(target_arch = "aarch64"))] +#[cfg(target_arch = "x86_64")] fn encode_into(input: &[u8], out: &mut [u8]) { + use block::Avx2Encoder; + + // Runtime feature gate: the AVX2 intrinsics are gated on + // `#[target_feature(enable = "avx2")]`; calling them on a CPU + // lacking AVX2 is UB. Detect once at the entry point and fall + // back to the scalar path otherwise. The branch predicts + // perfectly across calls so the cost is negligible. + if !std::is_x86_feature_detected!("avx2") { + encode_scalar(input, out); + return; + } + + let encoder = Avx2Encoder::new(); + let mut in_off = 0usize; + let mut out_off = 0usize; + + // 32 → 40 byte AVX2 chunks (8 blocks per iteration). + while in_off.saturating_add(32) <= input.len() { + let Some(in_chunk) = chunk_at::<32>(input, in_off) else { + return; + }; + let Some(out_chunk) = chunk_at_mut::<40>(out, out_off) else { + return; + }; + // SAFETY: AVX2 verified above. + unsafe { encoder.encode_block_x8(in_chunk, out_chunk) }; + in_off = in_off.saturating_add(32); + out_off = out_off.saturating_add(40); + } + + while in_off.saturating_add(4) <= input.len() { + let Some(block) = chunk_at::<4>(input, in_off) else { + return; + }; + let Some(chunk) = chunk_at_mut::<5>(out, out_off) else { + return; + }; + encode_block(block, chunk); + in_off = in_off.saturating_add(4); + out_off = out_off.saturating_add(5); + } + + let tail_in_len = input.len().saturating_sub(in_off); + if tail_in_len > 0 { + let tail_out_len = tail_in_len.saturating_add(1); + if let (Some(tail_in), Some(tail_out)) = ( + input.get(in_off..in_off.saturating_add(tail_in_len)), + out.get_mut(out_off..out_off.saturating_add(tail_out_len)), + ) { + encode_tail(tail_in, tail_out); + } + } +} + +#[cfg(not(any(target_arch = "aarch64", target_arch = "x86_64")))] +fn encode_into(input: &[u8], out: &mut [u8]) { + encode_scalar(input, out); +} + +#[cfg(not(target_arch = "aarch64"))] +fn encode_scalar(input: &[u8], out: &mut [u8]) { let mut in_off = 0usize; let mut out_off = 0usize; @@ -290,8 +351,80 @@ fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> { Ok(()) } -#[cfg(not(target_arch = "aarch64"))] +#[cfg(target_arch = "x86_64")] +#[allow( + clippy::unwrap_used, + clippy::indexing_slicing, + reason = "infallible / in-bounds by loop-condition invariant; see SAFETY comment" +)] fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> { + use block::try_decode_block_x8; + + // Runtime feature gate: AVX2 needed for the fast path; otherwise + // route to scalar fallback. + if !std::is_x86_feature_detected!("avx2") { + return decode_scalar(input, out); + } + + let mut in_off = 0usize; + let mut out_off = 0usize; + + // 40-char (8-block) AVX2 chunks. On per-chunk failure (invalid + // char or u32 overflow), replay the same chunk scalar so the + // resulting DecodeError carries the precise failing byte position. + while in_off + 40 <= input.len() && out_off + 32 <= out.len() { + let Some(in_chunk) = chunk_at::<40>(input, in_off) else { + return Ok(()); + }; + let Some(out_chunk) = chunk_at_mut::<32>(out, out_off) else { + return Ok(()); + }; + + // SAFETY: AVX2 verified above. + let avx_ok = unsafe { try_decode_block_x8(in_chunk, out_chunk) }.is_ok(); + if avx_ok { + in_off += 40; + out_off += 32; + } else { + for _ in 0..8 { + let block: &[u8; 5] = (&input[in_off..in_off + 5]).try_into().unwrap(); + let chunk: &mut [u8; 4] = (&mut out[out_off..out_off + 4]).try_into().unwrap(); + decode_block(block, chunk, in_off)?; + in_off += 5; + out_off += 4; + } + } + } + + while in_off + 5 <= input.len() && out_off + 4 <= out.len() { + let block: &[u8; 5] = (&input[in_off..in_off + 5]).try_into().unwrap(); + let chunk: &mut [u8; 4] = (&mut out[out_off..out_off + 4]).try_into().unwrap(); + decode_block(block, chunk, in_off)?; + in_off += 5; + out_off += 4; + } + + let tail_in_len = input.len().saturating_sub(in_off); + if tail_in_len > 0 { + let tail_out_len = tail_in_len.saturating_sub(1); + if let (Some(tail_in), Some(tail_out)) = ( + input.get(in_off..in_off.saturating_add(tail_in_len)), + out.get_mut(out_off..out_off.saturating_add(tail_out_len)), + ) { + decode_tail(tail_in, tail_out, in_off)?; + } + } + + Ok(()) +} + +#[cfg(not(any(target_arch = "aarch64", target_arch = "x86_64")))] +fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> { + decode_scalar(input, out) +} + +#[cfg(not(target_arch = "aarch64"))] +fn decode_scalar(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> { let mut in_off = 0usize; let mut out_off = 0usize; diff --git a/src/ops/aarch64.rs b/src/ops/aarch64.rs index 1fa0a2f..0f017d8 100644 --- a/src/ops/aarch64.rs +++ b/src/ops/aarch64.rs @@ -43,62 +43,3 @@ fn div_magic(input: uint32x4_t) -> uint32x4_ vshrq_n_u32(high_u32, SHIFT) } } - -#[cfg(test)] -mod tests { - use super::*; - use quickcheck::{Arbitrary, quickcheck}; - use std::{ - arch::aarch64::{vld1q_u32, vst1q_u32}, - array, - }; - - #[derive(Debug, Clone, Copy)] - struct InputBlock([u32; 4]); - - impl Arbitrary for InputBlock { - fn arbitrary(g: &mut quickcheck::Gen) -> Self { - InputBlock(array::from_fn(|_| u32::arbitrary(g))) - } - } - - fn lanes(v: uint32x4_t) -> [u32; 4] { - let mut out = [0u32; 4]; - unsafe { vst1q_u32(out.as_mut_ptr(), v) }; - out - } - - fn load(input: &[u32; 4]) -> uint32x4_t { - unsafe { vld1q_u32(input.as_ptr()) } - } - - quickcheck! { - fn div_85_matches_scalar(block: InputBlock) -> bool { - let InputBlock(input) = block; - let q_out = lanes(div_85(load(&input))); - (0..4).all(|i| q_out[i] == input[i] / 85) - } - - fn div_85_sq_matches_scalar(block: InputBlock) -> bool { - let InputBlock(input) = block; - let q_out = lanes(div_85_sq(load(&input))); - (0..4).all(|i| q_out[i] == input[i] / (85 * 85)) - } - - fn div_85_cube_matches_scalar(block: InputBlock) -> bool { - let InputBlock(input) = block; - let q_out = lanes(div_85_cube(load(&input))); - (0..4).all(|i| q_out[i] == input[i] / (85 * 85 * 85)) - } - - // q4 = q2 / 85^2 — the composed path that yields n / 85^4. Direct - // /85^4 magic would not satisfy the libdivide precision bound. - fn div_85_to_the_4_via_composition(block: InputBlock) -> bool { - let InputBlock(input) = block; - let q2 = div_85_sq(load(&input)); - let q4 = div_85_sq(q2); - let q4_out = lanes(q4); - (0..4).all(|i| q4_out[i] == input[i] / (85u32.pow(4))) - } - } -} diff --git a/src/ops/mod.rs b/src/ops/mod.rs index aee763e..369e3b2 100644 --- a/src/ops/mod.rs +++ b/src/ops/mod.rs @@ -2,12 +2,18 @@ //! //! Production scalar code uses plain `n / 85` / `n % 85` because the //! compiler lowers those to the libdivide-style multiply-then-shift -//! sequence in release builds. The hand-written NEON helpers in -//! [`aarch64`] cover the parallel-divide-by-`85ᵏ` cases the compiler -//! cannot auto-vectorise across u32x4 lanes. +//! sequence in release builds. The hand-written SIMD helpers below +//! cover the parallel-divide-by-`85ᵏ` cases the compiler cannot +//! auto-vectorise across u32x4 lanes. #[cfg(target_arch = "aarch64")] mod aarch64; #[cfg(target_arch = "aarch64")] pub(crate) use aarch64::{div_85, div_85_cube, div_85_sq}; + +#[cfg(target_arch = "x86_64")] +mod x86_64; + +#[cfg(target_arch = "x86_64")] +pub(crate) use x86_64::{div_85, div_85_cube, div_85_sq}; diff --git a/src/ops/x86_64.rs b/src/ops/x86_64.rs new file mode 100644 index 0000000..91dfac5 --- /dev/null +++ b/src/ops/x86_64.rs @@ -0,0 +1,107 @@ +//! x86_64 AVX2 helpers — structural mirror of `aarch64.rs`, doubled. +//! +//! Every helper operates on a `__m256i` carrying **eight** u32 lanes +//! (vs the four-lane `uint32x4_t` of NEON). The lower 128-bit half +//! handles blocks 0..3 and the upper half handles blocks 4..7; +//! AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE) +//! work fine because the algorithm is fully per-lane — each 128-bit +//! half runs an independent copy of the SSE4.1 algorithm we replaced. +//! +//! The magic constants are the same libdivide values used on aarch64; +//! only the SIMD intrinsics differ. AVX2 instructions used: +//! +//! | NEON | AVX2 | what it does | +//! |------------------------|-------------------------------|-------------------------------------| +//! | `vmull_u32` | `_mm256_mul_epu32` | u32×u32 → u64 widening (4 lanes) | +//! | `vshrq_n_u32` | `_mm256_srli_epi32` | logical shift right per u32 lane | +//! | `vmulq_n_u32` | `_mm256_mullo_epi32` | u32×u32 → u32 (low 32 bits, 8 lanes)| +//! | `vsubq_u32` | `_mm256_sub_epi32` | u32 lane-wise subtract | +//! | `vqtbl1q_u8` | `_mm256_shuffle_epi8` | byte-shuffle, **per 128-bit lane** | +//! | `vrev32q_u8` | `_mm256_shuffle_epi8` + idx | reverse bytes in each u32 lane | +//! +//! All SIMD entry points carry `#[target_feature(enable = "avx2")]` +//! so the file compiles on any `x86_64-*-*` target regardless of the +//! default target-feature set. Callers must perform a runtime +//! `is_x86_feature_detected!("avx2")` check before invoking; on hosts +//! lacking AVX2 the crate routes to the scalar fallback. +//! +//! The module-level `#![allow(unsafe_op_in_unsafe_fn)]` keeps the body +//! of each `unsafe fn` an implicit unsafe scope on MSRV 1.85 (without +//! it, each intrinsic call needs its own `unsafe { … }`). + +#![allow(unsafe_op_in_unsafe_fn)] + +use std::arch::x86_64::{ + __m256i, _mm256_blend_epi16, _mm256_mul_epu32, _mm256_set1_epi32, _mm256_shuffle_epi32, + _mm256_srli_epi32, _mm256_srli_epi64, +}; + +/// Lane-wise `input / 85` (8 u32 lanes). Magic verified for full u32 range. +/// +/// # Safety +/// Requires AVX2 to be available on the host CPU. +#[inline] +#[target_feature(enable = "avx2")] +pub(crate) unsafe fn div_85(input: __m256i) -> __m256i { + // m = ceil(2^38 / 85) = 3_233_857_729; m * 85 - 2^38 = 21 < 2^6. + div_magic::<3_233_857_729, 6>(input) +} + +/// Lane-wise `input / 7225` (i.e. `input / 85²`). Valid for all u32 inputs. +/// +/// # Safety +/// Requires AVX2 to be available on the host CPU. +#[inline] +#[target_feature(enable = "avx2")] +pub(crate) unsafe fn div_85_sq(input: __m256i) -> __m256i { + // m = ceil(2^44 / 7225) = 2_434_904_643; m * 7225 - 2^44 = 1259 < 2^12. + div_magic::<2_434_904_643, 12>(input) +} + +/// Lane-wise `input / 614125` (i.e. `input / 85³`). Valid for all u32 inputs. +/// +/// # Safety +/// Requires AVX2 to be available on the host CPU. +#[inline] +#[target_feature(enable = "avx2")] +pub(crate) unsafe fn div_85_cube(input: __m256i) -> __m256i { + // m = ceil(2^51 / 614125) = 3_666_679_933; bound check passes for u32. + div_magic::<3_666_679_933, 19>(input) +} + +/// Generic libdivide-style div over 8 u32 lanes: +/// `q = (n * MAGIC) >> (32 + SHIFT)`. +/// +/// Same algorithm as the SSE4.1 version it replaces, just doubled. +/// `_mm256_mul_epu32` multiplies the low u32 of each u64 lane (so it +/// touches input lanes 0, 2, 4, 6 per call); we do two multiplies — +/// one as-is, one after `_mm256_srli_epi64::<32>` — and combine the +/// high u32 of each u64 product via `_mm256_shuffle_epi32::<0xF5>` + +/// `_mm256_blend_epi16::<0xCC>`. Both of those are lane-restricted ops +/// (per 128-bit half), but our blend pattern lives entirely within each +/// half so that's fine. +/// +/// # Safety +/// Requires AVX2 (the function-level `target_feature` satisfies this +/// for any caller that has done the runtime check). +#[inline] +#[target_feature(enable = "avx2")] +unsafe fn div_magic(input: __m256i) -> __m256i { + let magic = _mm256_set1_epi32(MAGIC as i32); + + let even = _mm256_mul_epu32(input, magic); + let input_odd = _mm256_srli_epi64::<32>(input); + let odd = _mm256_mul_epu32(input_odd, magic); + + // Per-128-bit-lane shuffle: broadcasts hi u32 of each u64 product + // across the pair, in each lane independently. + let even_hi = _mm256_shuffle_epi32::<0b11_11_01_01>(even); + let odd_hi = _mm256_shuffle_epi32::<0b11_11_01_01>(odd); + + // Per-lane blend at u16 granularity. The 8-bit MASK is applied to + // each 128-bit lane, so 0xCC interleaves [hi0, hi1, hi2, hi3] + // within each lane independently. + let combined = _mm256_blend_epi16::<0xCC>(even_hi, odd_hi); + + _mm256_srli_epi32::(combined) +}