diff --git a/README.md b/README.md
index c8286d6..93bef02 100644
--- a/README.md
+++ b/README.md
@@ -2,10 +2,11 @@
 
 Fast Base85 (RFC 1924 / Z85-style) encoder and decoder for Rust.
 
-On `aarch64` the encode path uses NEON intrinsics to process 16 input bytes
-per iteration; the decode path and the fallback for other architectures use
-a portable scalar implementation. Output is byte-for-byte compatible with
-the [`base85`](https://crates.io/crates/base85) crate.
+SIMD-accelerated on aarch64 (NEON, 4 blocks per iteration) and x86_64
+(AVX2, 8 blocks per iteration), with a portable scalar fallback for
+everything else and for x86_64 hosts lacking AVX2 (rare on server
+hardware after ~2013). Output is byte-for-byte compatible with the
+[`base85`](https://crates.io/crates/base85) crate.
 
 ## Usage
 
@@ -33,7 +34,10 @@ for decoding.
 ## Status
 
 - Public API: `encode(&[u8]) -> String`, `decode(&str) -> Result<Vec<u8>, DecodeError>`.
-- Encode and decode are both NEON-accelerated on `aarch64`, scalar elsewhere.
+- aarch64: NEON-accelerated (4 blocks at a time, always available on aarch64).
+- x86_64 with AVX2: AVX2-accelerated (8 blocks at a time). Runtime feature
+  detection at the public API entry — hosts without AVX2 fall back to scalar.
+- Other architectures: portable scalar implementation.
 - The decode path validates char range and detects `u32` overflow lane-wise;
   any invalid input falls back to the scalar path so the resulting
   `DecodeError` carries a precise byte position.
@@ -74,43 +78,82 @@ strictly more diagnostic information. Code that pattern-matches on
 
 ## Benchmarks
 
-Apple M-series (aarch64), `cargo bench --bench encode`, criterion, release
-profile, single-threaded. Times are the criterion-reported median; throughput
-is computed from the median.
-
-### Encode
-
-| size  | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
-|-------|--------------:|-------------------:|--------:|-------------------------:|
-| 16 B  | 17.4 ns       | 16.3 ns            | 1.07×   | ~940 MiB/s               |
-| 64 B  | 33.0 ns       | 22.6 ns            | 1.46×   | 2.71 GiB/s               |
-| 256 B | 115.7 ns      | 73.2 ns            | 1.58×   | 3.26 GiB/s               |
-| 1 KiB | 378.3 ns      | 225.1 ns           | 1.68×   | 4.24 GiB/s               |
-| 16 KiB| 5.56 µs       | 3.60 µs            | 1.55×   | 4.24 GiB/s               |
-| 256 KiB| 89.3 µs      | 55.4 µs            | 1.61×   | 4.41 GiB/s               |
-| 1 MiB | 356 µs        | 222 µs             | 1.61×   | 4.40 GiB/s               |
-
-### Decode
-
-| size  | `base85` time | `base85-simd` time | speedup | `base85-simd` throughput |
-|-------|--------------:|-------------------:|--------:|-------------------------:|
-| 16 B  | 32.4 ns       | 14.8 ns            | 2.18×   | ~1.0 GiB/s               |
-| 64 B  | 123.5 ns      | 24.6 ns            | 5.02×   | 2.42 GiB/s               |
-| 256 B | 579 ns        | 57.6 ns            | 10.05×  | 4.14 GiB/s               |
-| 1 KiB | 2.27 µs       | 226 ns             | 10.06×  | 4.22 GiB/s               |
-| 16 KiB| 36.8 µs       | 3.49 µs            | 10.55×  | 4.38 GiB/s               |
-| 256 KiB| 591 µs       | 54.3 µs            | 10.89×  | 4.50 GiB/s               |
-| 1 MiB | 2.28 ms       | 217.6 µs           | 10.49×  | 4.49 GiB/s               |
+`cargo bench --bench encode`, criterion, release profile, single-threaded.
+Times are the criterion-reported median; throughput computed from it.
+
+### aarch64 (Apple M-series)
+
+#### Encode
+
+| size  | `base85` | `base85-simd` (NEON) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 17.4 ns  | 16.3 ns              | 1.07×   | ~940 MiB/s |
+| 64 B  | 33.0 ns  | 22.6 ns              | 1.46×   | 2.71 GiB/s |
+| 256 B | 115.7 ns | 73.2 ns              | 1.58×   | 3.26 GiB/s |
+| 1 KiB | 378.3 ns | 225.1 ns             | 1.68×   | 4.24 GiB/s |
+| 16 KiB| 5.56 µs  | 3.60 µs              | 1.55×   | 4.24 GiB/s |
+| 256 KiB| 89.3 µs | 55.4 µs              | 1.61×   | 4.41 GiB/s |
+| 1 MiB | 356 µs   | 222 µs               | 1.61×   | 4.40 GiB/s |
+
+#### Decode
+
+| size  | `base85` | `base85-simd` (NEON) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 32.4 ns  | 14.8 ns              | 2.18×   | ~1.0 GiB/s |
+| 64 B  | 123.5 ns | 24.6 ns              | 5.02×   | 2.42 GiB/s |
+| 256 B | 579 ns   | 57.6 ns              | 10.05×  | 4.14 GiB/s |
+| 1 KiB | 2.27 µs  | 226 ns               | 10.06×  | 4.22 GiB/s |
+| 16 KiB| 36.8 µs  | 3.49 µs              | 10.55×  | 4.38 GiB/s |
+| 256 KiB| 591 µs  | 54.3 µs              | 10.89×  | 4.50 GiB/s |
+| 1 MiB | 2.28 ms  | 217.6 µs             | 10.49×  | 4.49 GiB/s |
+
+### x86_64 (AMD EPYC 7763, Zen 3)
+
+Numbers from a GitHub Actions hosted Ubuntu runner — shared/virtualised
+hardware so noise is higher than aarch64 (~5–15% variance), but the
+relative speedups are stable.
+
+#### Encode
+
+| size  | `base85` | `base85-simd` (AVX2) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 41.0 ns  | 57.8 ns              | 0.71×   | (scalar fallback; chunk doesn't fit) |
+| 64 B  | 100.5 ns | 61.7 ns              | 1.63×   | 989 MiB/s  |
+| 256 B | 341.9 ns | 165.9 ns             | 2.06×   | 1.44 GiB/s |
+| 1 KiB | 1.32 µs  | 507.0 ns             | 2.61×   | 1.88 GiB/s |
+| 16 KiB| 20.4 µs  | 7.48 µs              | 2.72×   | 2.04 GiB/s |
+| 256 KiB| 323.9 µs| 118.9 µs             | 2.72×   | 2.05 GiB/s |
+| 1 MiB | 1.31 ms  | 482.9 µs             | 2.71×   | 2.07 GiB/s |
+
+#### Decode
+
+| size  | `base85` | `base85-simd` (AVX2) | speedup | throughput |
+|-------|---------:|---------------------:|--------:|-----------:|
+| 16 B  | 70.2 ns  | 61.2 ns              | 1.15×   | (scalar fallback) |
+| 64 B  | 244.8 ns | 67.6 ns              | 3.62×   | 903 MiB/s  |
+| 256 B | 1.046 µs | 164.4 ns             | 6.36×   | 1.45 GiB/s |
+| 1 KiB | 4.19 µs  | 466.5 ns             | 8.98×   | 2.04 GiB/s |
+| 16 KiB| 65.7 µs  | 6.75 µs              | 9.73×   | 2.26 GiB/s |
+| 256 KiB| 1.058 ms| 107.2 µs             | 9.86×   | 2.28 GiB/s |
+| 1 MiB | 4.14 ms  | 431.6 µs             | 9.59×   | 2.32 GiB/s |
 
 ### Steady-state summary
 
-At sizes large enough to amortise loop setup (≥ 256 B), `base85-simd`
-sustains **~4.4 GiB/s** for both encode and decode on Apple M-series,
-roughly **1.6× faster** than the reference for encode and **~10×
-faster** for decode. The decode advantage comes from `TBL`-based ASCII
-→ digit lookup that replaces the reference's per-character branchy
-match; the encode advantage comes from 4-lane parallel `divmod 85`
-plus `TBX`-based digit → ASCII / output permutation.
+At sizes large enough to amortise the SIMD loop setup (≥ 256 B):
+
+| arch / ISA  | encode throughput | encode speedup | decode throughput | decode speedup |
+|-------------|------------------:|---------------:|------------------:|---------------:|
+| aarch64 NEON| 4.40 GiB/s        | 1.61×          | 4.49 GiB/s        | 10.49×         |
+| x86_64 AVX2 | 2.07 GiB/s        | 2.71×          | 2.32 GiB/s        | 9.59×          |
+
+The decode speedup ratio is roughly the same on both architectures
+(~10×), driven by SIMD-accelerated ASCII → digit table lookup
+replacing the reference's per-character branchy match. NEON sustains
+roughly 2× the absolute throughput of AVX2 because its `vqtbl4q_u8`
+does a 64-entry lookup in a single instruction, where x86 PSHUFB is
+limited to 16 entries (so the lookup expands to ~6 PSHUFB+OR per
+chunk on x86). AVX-512 VBMI's `vpermb` would close that gap but
+isn't available on the AMD silicon used by GitHub's runner fleet.
 
 Reproduce with:
 
diff --git a/src/block.rs b/src/block.rs
index ef040fc..3a66f35 100644
--- a/src/block.rs
+++ b/src/block.rs
@@ -150,6 +150,48 @@ pub(crate) fn decode_tail(
     Ok(())
 }
 
+// ─────────────────────────────────────────────────────────────────────────
+// SIMD shared tables (used by both NEON and AVX2 paths)
+// ─────────────────────────────────────────────────────────────────────────
+
+// Per-digit splice indexes for the 4-block encode shuffle. Output position
+// of digit k of block i is `5*i + k`; positions ≥ 16 spill into out_2 (with
+// offset −16). Each v_dk is a u32x4 with the digit in byte 0/4/8/12 of its
+// lane, hence the source offsets `4*i`. Both NEON (vqtbx1q_u8) and AVX2
+// (PSHUFB, broadcast to both 128-bit halves) consume the identical patterns.
+
+#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))]
+#[rustfmt::skip]
+pub(super) static IDX_OUT_1: [[u8; 16]; 5] = [
+    // k=0: out_1[0,5,10,15] ← src bytes 0,4,8,12
+    [0,    0xFF, 0xFF, 0xFF, 0xFF, 4,    0xFF, 0xFF, 0xFF, 0xFF, 8,    0xFF, 0xFF, 0xFF, 0xFF, 12],
+    // k=1: out_1[1,6,11] ← 0,4,8 (block 3 → out_2)
+    [0xFF, 0,    0xFF, 0xFF, 0xFF, 0xFF, 4,    0xFF, 0xFF, 0xFF, 0xFF, 8,    0xFF, 0xFF, 0xFF, 0xFF],
+    // k=2: out_1[2,7,12] ← 0,4,8
+    [0xFF, 0xFF, 0,    0xFF, 0xFF, 0xFF, 0xFF, 4,    0xFF, 0xFF, 0xFF, 0xFF, 8,    0xFF, 0xFF, 0xFF],
+    // k=3: out_1[3,8,13] ← 0,4,8
+    [0xFF, 0xFF, 0xFF, 0,    0xFF, 0xFF, 0xFF, 0xFF, 4,    0xFF, 0xFF, 0xFF, 0xFF, 8,    0xFF, 0xFF],
+    // k=4: out_1[4,9,14] ← 0,4,8
+    [0xFF, 0xFF, 0xFF, 0xFF, 0,    0xFF, 0xFF, 0xFF, 0xFF, 4,    0xFF, 0xFF, 0xFF, 0xFF, 8,    0xFF],
+];
+
+// Indexes for digits 1..4 of block 3 (src byte 12) spilling into out_2.
+#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))]
+#[rustfmt::skip]
+pub(super) static IDX_OUT_2: [[u8; 16]; 4] = [
+    [12,   0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+    [0xFF, 12,   0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+    [0xFF, 0xFF, 12,   0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+    [0xFF, 0xFF, 0xFF, 12,   0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+];
+
+// Mask used by the SIMD decoders: 0xFF for the 4 real-input lanes of the
+// padded tail register, 0 for the 12 padding lanes. Lets us suppress
+// validation noise from the padding bytes.
+#[cfg(any(target_arch = "aarch64", target_arch = "x86_64"))]
+pub(super) static TAIL_VALID_MASK: [u8; 16] =
+    [0xFF, 0xFF, 0xFF, 0xFF, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
+
 // ─────────────────────────────────────────────────────────────────────────
 // NEON encoder (aarch64)
 // ─────────────────────────────────────────────────────────────────────────
@@ -172,52 +214,7 @@ mod neon {
         vreinterpretq_u8_u32, vreinterpretq_u32_u8, vrev32q_u8, vst1q_u8, vsubq_u8,
     };
 
-    // Per-digit splice indexes, computed at compile time and loaded
-    // from .rodata each call. Output position of digit k of block i is
-    // `5*i + k`; positions ≥ 16 spill into out_2 (with offset −16).
-    // Each v_dk is a u32x4 with the digit in byte 0/4/8/12 of its lane,
-    // hence the source offsets `4*i`.
-    static IDX_OUT_1: [[u8; 16]; 5] = [
-        // k=0: out_1[0,5,10,15] ← src bytes 0,4,8,12
-        [
-            0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF, 12,
-        ],
-        // k=1: out_1[1,6,11] ← 0,4,8 (block 3 → out_2)
-        [
-            0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 0xFF,
-        ],
-        // k=2: out_1[2,7,12] ← 0,4,8
-        [
-            0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF,
-        ],
-        // k=3: out_1[3,8,13] ← 0,4,8
-        [
-            0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF,
-        ],
-        // k=4: out_1[4,9,14] ← 0,4,8
-        [
-            0xFF, 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF, 0xFF, 4, 0xFF, 0xFF, 0xFF, 0xFF, 8, 0xFF,
-        ],
-    ];
-    // Indexes for digits 1..4 spilling into out_2 (block 3, src byte 12).
-    static IDX_OUT_2: [[u8; 16]; 4] = [
-        [
-            12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
-            0xFF,
-        ],
-        [
-            0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
-            0xFF,
-        ],
-        [
-            0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
-            0xFF,
-        ],
-        [
-            0xFF, 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
-            0xFF,
-        ],
-    ];
+    use super::{IDX_OUT_1, IDX_OUT_2, TAIL_VALID_MASK};
 
     /// Encodes 16 input bytes into 20 output characters using NEON.
     ///
@@ -339,10 +336,6 @@ mod neon {
         tail_buf[..4].copy_from_slice(&input[16..20]);
         let chars_tail = unsafe { vld1q_u8(tail_buf.as_ptr()) };
 
-        // Mask: 0xFF for the 4 real-input lanes of `chars_tail`, 0 for the
-        // 12 padding lanes — used to suppress validation noise from padding.
-        static TAIL_VALID_MASK: [u8; 16] =
-            [0xFF, 0xFF, 0xFF, 0xFF, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
         let tail_valid_mask = unsafe { vld1q_u8(TAIL_VALID_MASK.as_ptr()) };
 
         // Char range validation: any byte not in [33, 126] is invalid.
@@ -514,6 +507,386 @@ mod neon {
     }
 }
 
+// ─────────────────────────────────────────────────────────────────────────
+// AVX2 encoder / decoder (x86_64)
+// ─────────────────────────────────────────────────────────────────────────
+//
+// Processes 8 blocks per call (32 input bytes, 40 output chars) by
+// running the same SSE4.1 algorithm across both 128-bit lanes of an
+// `__m256i`. AVX2's lane-restricted byte/word ops (PSHUFB, PSHUFD,
+// BLEND) work fine here because the algorithm is fully per-lane —
+// each lane handles 4 independent blocks.
+//
+// On hosts without AVX2 the crate's runtime feature gate routes to
+// the scalar fallback (see `crate::encode_scalar` / `crate::decode_scalar`).
+
+#[cfg(target_arch = "x86_64")]
+pub(crate) use avx2::{Avx2Encoder, try_decode_block_x8};
+
+#[cfg(target_arch = "x86_64")]
+mod avx2 {
+    //! AVX2 encode/decode — 8 blocks per call.
+    //!
+    //! Both 128-bit lanes of each `__m256i` independently run the
+    //! algorithm we developed for SSE4.1 (4 blocks per lane, 8 total).
+    //! AVX2 byte/word ops are lane-restricted (PSHUFB / BLEND /
+    //! SHUFFLE_EPI32 / etc. operate per 128-bit half), but every
+    //! step of the algorithm is per-lane anyway, so that's fine.
+    //!
+    //! Every entry point is `unsafe fn` with
+    //! `#[target_feature(enable = "avx2")]`. The caller
+    //! (`crate::encode_into` / `crate::decode_into`) does a one-shot
+    //! `is_x86_feature_detected!("avx2")` check; on hosts lacking
+    //! AVX2 the crate routes to the scalar fallback.
+    //!
+    //! The module-level `#![allow(unsafe_op_in_unsafe_fn)]` keeps the
+    //! body of each `unsafe fn` an implicit unsafe scope on MSRV 1.85
+    //! (without it, each intrinsic call needs its own `unsafe { … }`).
+
+    #![allow(unsafe_op_in_unsafe_fn)]
+    #![allow(clippy::indexing_slicing)]
+
+    use crate::ops::{div_85, div_85_cube, div_85_sq};
+    use std::arch::x86_64::{
+        __m128i, __m256i, _mm_loadu_si128, _mm_storeu_si128, _mm256_add_epi32, _mm256_and_si256,
+        _mm256_broadcastsi128_si256, _mm256_cmpeq_epi8, _mm256_cmpgt_epi8, _mm256_cmpgt_epi32,
+        _mm256_extract_epi32, _mm256_extracti128_si256, _mm256_loadu_si256, _mm256_movemask_epi8,
+        _mm256_mullo_epi32, _mm256_or_si256, _mm256_set_m128i, _mm256_set1_epi8, _mm256_set1_epi32,
+        _mm256_setzero_si256, _mm256_shuffle_epi8, _mm256_srli_epi16, _mm256_storeu_si256,
+        _mm256_sub_epi32, _mm256_xor_si256,
+    };
+
+    use super::{IDX_OUT_1, IDX_OUT_2, TAIL_VALID_MASK};
+
+    static REV32_IDX: [u8; 16] = [3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12];
+
+    #[rustfmt::skip]
+    static ALPHABET_PADDED: &[u8; 96] =
+        b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^_`{|}~\0\0\0\0\0\0\0\0\0\0\0";
+
+    #[rustfmt::skip]
+    static CHAR_TO_85_PADDED: [u8; 96] = [
+        62, 0xFF, 63, 64, 65, 66, 0xFF, 67, 68, 69, 70, 0xFF, 71, 0xFF, 0xFF, 0,
+        1, 2, 3, 4, 5, 6, 7, 8, 9, 0xFF, 72, 73, 74, 75, 76, 77,
+        10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
+        26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 0xFF, 0xFF, 0xFF, 78, 79, 80,
+        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
+        52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 81, 82, 83, 84, 0xFF, 0xFF,
+    ];
+
+    #[rustfmt::skip]
+    static IDX_DEC: [([u8; 16], [u8; 16]); 5] = [
+        (
+            [0, 0xFF, 0xFF, 0xFF, 5, 0xFF, 0xFF, 0xFF, 10, 0xFF, 0xFF, 0xFF, 15, 0xFF, 0xFF, 0xFF],
+            [0xFF; 16],
+        ),
+        (
+            [1, 0xFF, 0xFF, 0xFF, 6, 0xFF, 0xFF, 0xFF, 11, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+            [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0, 0xFF, 0xFF, 0xFF],
+        ),
+        (
+            [2, 0xFF, 0xFF, 0xFF, 7, 0xFF, 0xFF, 0xFF, 12, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+            [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 1, 0xFF, 0xFF, 0xFF],
+        ),
+        (
+            [3, 0xFF, 0xFF, 0xFF, 8, 0xFF, 0xFF, 0xFF, 13, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+            [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 2, 0xFF, 0xFF, 0xFF],
+        ),
+        (
+            [4, 0xFF, 0xFF, 0xFF, 9, 0xFF, 0xFF, 0xFF, 14, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF],
+            [0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 3, 0xFF, 0xFF, 0xFF],
+        ),
+    ];
+
+    /// Loads a 16-byte index from .rodata and broadcasts it to both
+    /// 128-bit lanes of an `__m256i`.
+    ///
+    /// # Safety
+    /// `ptr` must be valid for a 16-byte read. Caller has AVX2.
+    #[inline]
+    #[target_feature(enable = "avx2")]
+    unsafe fn broadcast16(ptr: *const u8) -> __m256i {
+        _mm256_broadcastsi128_si256(_mm_loadu_si128(ptr.cast::<__m128i>()))
+    }
+
+    /// AVX2 8-block encoder. Mirrors the structure of NEON's
+    /// `NeonEncoder` but processes 8 input blocks (32 bytes → 40 chars)
+    /// per call.
+    pub(crate) struct Avx2Encoder;
+
+    impl Avx2Encoder {
+        #[inline]
+        pub(crate) fn new() -> Self {
+            Self
+        }
+
+        /// Encode exactly 32 input bytes into exactly 40 output bytes.
+        ///
+        /// # Safety
+        /// Requires AVX2 to be available on the host CPU.
+        #[inline]
+        #[target_feature(enable = "avx2")]
+        pub(crate) unsafe fn encode_block_x8(&self, in_bytes: &[u8; 32], out_bytes: &mut [u8; 40]) {
+            // Load 32 bytes (8 × 4-byte blocks) and reverse each 4-byte
+            // group to get 8 BE u32s.
+            let raw = _mm256_loadu_si256(in_bytes.as_ptr().cast::<__m256i>());
+            let n = _mm256_shuffle_epi8(raw, broadcast16(REV32_IDX.as_ptr()));
+
+            // Parallel-magic divides — 8 lanes wide.
+            let q1 = div_85(n);
+            let q2 = div_85_sq(n);
+            let q3 = div_85_cube(n);
+            let q4 = div_85_sq(q2);
+
+            // Five digits per block via lane-wise multiply-subtract.
+            let m85 = _mm256_set1_epi32(85);
+            let d0 = q4;
+            let d1 = _mm256_sub_epi32(q3, _mm256_mullo_epi32(q4, m85));
+            let d2 = _mm256_sub_epi32(q2, _mm256_mullo_epi32(q3, m85));
+            let d3 = _mm256_sub_epi32(q1, _mm256_mullo_epi32(q2, m85));
+            let d4 = _mm256_sub_epi32(n, _mm256_mullo_epi32(q1, m85));
+
+            // Splice 5 digits into out_1 (32 bytes; 16 per lane — each
+            // lane holds the first 16 chars of its 4-block group).
+            // PSHUFB is per-lane; the same index broadcast to both lanes
+            // does the right thing.
+            let idx0 = broadcast16(IDX_OUT_1[0].as_ptr());
+            let idx1 = broadcast16(IDX_OUT_1[1].as_ptr());
+            let idx2 = broadcast16(IDX_OUT_1[2].as_ptr());
+            let idx3 = broadcast16(IDX_OUT_1[3].as_ptr());
+            let idx4 = broadcast16(IDX_OUT_1[4].as_ptr());
+            let p0 = _mm256_shuffle_epi8(d0, idx0);
+            let p1 = _mm256_shuffle_epi8(d1, idx1);
+            let p2 = _mm256_shuffle_epi8(d2, idx2);
+            let p3 = _mm256_shuffle_epi8(d3, idx3);
+            let p4 = _mm256_shuffle_epi8(d4, idx4);
+            let out_1 = _mm256_or_si256(
+                _mm256_or_si256(p0, p1),
+                _mm256_or_si256(p2, _mm256_or_si256(p3, p4)),
+            );
+
+            // Same for out_2 (8 bytes total; 4 per lane — the trailing
+            // 4 chars of each 4-block group).
+            let i20 = broadcast16(IDX_OUT_2[0].as_ptr());
+            let i21 = broadcast16(IDX_OUT_2[1].as_ptr());
+            let i22 = broadcast16(IDX_OUT_2[2].as_ptr());
+            let i23 = broadcast16(IDX_OUT_2[3].as_ptr());
+            let q21 = _mm256_shuffle_epi8(d1, i20);
+            let q22 = _mm256_shuffle_epi8(d2, i21);
+            let q23 = _mm256_shuffle_epi8(d3, i22);
+            let q24 = _mm256_shuffle_epi8(d4, i23);
+            let out_2 = _mm256_or_si256(_mm256_or_si256(q21, q22), _mm256_or_si256(q23, q24));
+
+            // Digit value (0..84) → ASCII char.
+            let out_1 = byte_to_char85_avx2(out_1);
+            let out_2 = byte_to_char85_avx2(out_2);
+
+            // Store. Output layout is non-contiguous between lanes:
+            //   bytes  0..16  — lane 0's 16-char main
+            //   bytes 16..20  — lane 0's 4-char tail
+            //   bytes 20..36  — lane 1's 16-char main
+            //   bytes 36..40  — lane 1's 4-char tail
+            let lo_main = _mm256_extracti128_si256::<0>(out_1);
+            let hi_main = _mm256_extracti128_si256::<1>(out_1);
+            let lo_tail = _mm256_extract_epi32::<0>(out_2) as u32;
+            let hi_tail = _mm256_extract_epi32::<4>(out_2) as u32;
+
+            _mm_storeu_si128(out_bytes.as_mut_ptr().cast::<__m128i>(), lo_main);
+            std::ptr::write_unaligned(out_bytes.as_mut_ptr().add(16).cast::<u32>(), lo_tail);
+            _mm_storeu_si128(out_bytes.as_mut_ptr().add(20).cast::<__m128i>(), hi_main);
+            std::ptr::write_unaligned(out_bytes.as_mut_ptr().add(36).cast::<u32>(), hi_tail);
+        }
+    }
+
+    /// Maps 32 lane-wise digit values (0..84) to ASCII characters.
+    ///
+    /// Same chunk-index-selection trick as the SSE port (PSHUFB only
+    /// zeros on bit-7-set indexes), just doubled.
+    #[inline]
+    #[target_feature(enable = "avx2")]
+    unsafe fn byte_to_char85_avx2(x85: __m256i) -> __m256i {
+        let high_nib = _mm256_and_si256(_mm256_srli_epi16::<4>(x85), _mm256_set1_epi8(0x0F));
+        let low_nib = _mm256_and_si256(x85, _mm256_set1_epi8(0x0F));
+
+        let t0 = broadcast16(ALPHABET_PADDED.as_ptr());
+        let t1 = broadcast16(ALPHABET_PADDED.as_ptr().add(16));
+        let t2 = broadcast16(ALPHABET_PADDED.as_ptr().add(32));
+        let t3 = broadcast16(ALPHABET_PADDED.as_ptr().add(48));
+        let t4 = broadcast16(ALPHABET_PADDED.as_ptr().add(64));
+        let t5 = broadcast16(ALPHABET_PADDED.as_ptr().add(80));
+
+        let r0 = _mm256_shuffle_epi8(t0, low_nib);
+        let r1 = _mm256_shuffle_epi8(t1, low_nib);
+        let r2 = _mm256_shuffle_epi8(t2, low_nib);
+        let r3 = _mm256_shuffle_epi8(t3, low_nib);
+        let r4 = _mm256_shuffle_epi8(t4, low_nib);
+        let r5 = _mm256_shuffle_epi8(t5, low_nib);
+
+        let m0 = _mm256_cmpeq_epi8(high_nib, _mm256_setzero_si256());
+        let m1 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(1));
+        let m2 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(2));
+        let m3 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(3));
+        let m4 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(4));
+        let m5 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(5));
+
+        _mm256_or_si256(
+            _mm256_or_si256(
+                _mm256_or_si256(_mm256_and_si256(r0, m0), _mm256_and_si256(r1, m1)),
+                _mm256_or_si256(_mm256_and_si256(r2, m2), _mm256_and_si256(r3, m3)),
+            ),
+            _mm256_or_si256(_mm256_and_si256(r4, m4), _mm256_and_si256(r5, m5)),
+        )
+    }
+
+    /// Maps 32 ASCII chars to base85 digit values (0..84) lane-wise.
+    /// Chars not in the alphabet (or out of [33, 126]) come back as 0xFF.
+    #[inline]
+    #[target_feature(enable = "avx2")]
+    unsafe fn char85_to_byte_avx2(chars: __m256i) -> __m256i {
+        let normalised = _mm256_sub_epi32(chars, _mm256_set1_epi8(33));
+        let high_nib = _mm256_and_si256(_mm256_srli_epi16::<4>(normalised), _mm256_set1_epi8(0x0F));
+        let low_nib = _mm256_and_si256(normalised, _mm256_set1_epi8(0x0F));
+
+        let t0 = broadcast16(CHAR_TO_85_PADDED.as_ptr());
+        let t1 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(16));
+        let t2 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(32));
+        let t3 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(48));
+        let t4 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(64));
+        let t5 = broadcast16(CHAR_TO_85_PADDED.as_ptr().add(80));
+
+        let r0 = _mm256_shuffle_epi8(t0, low_nib);
+        let r1 = _mm256_shuffle_epi8(t1, low_nib);
+        let r2 = _mm256_shuffle_epi8(t2, low_nib);
+        let r3 = _mm256_shuffle_epi8(t3, low_nib);
+        let r4 = _mm256_shuffle_epi8(t4, low_nib);
+        let r5 = _mm256_shuffle_epi8(t5, low_nib);
+
+        let m0 = _mm256_cmpeq_epi8(high_nib, _mm256_setzero_si256());
+        let m1 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(1));
+        let m2 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(2));
+        let m3 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(3));
+        let m4 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(4));
+        let m5 = _mm256_cmpeq_epi8(high_nib, _mm256_set1_epi8(5));
+
+        _mm256_or_si256(
+            _mm256_or_si256(
+                _mm256_or_si256(_mm256_and_si256(r0, m0), _mm256_and_si256(r1, m1)),
+                _mm256_or_si256(_mm256_and_si256(r2, m2), _mm256_and_si256(r3, m3)),
+            ),
+            _mm256_or_si256(_mm256_and_si256(r4, m4), _mm256_and_si256(r5, m5)),
+        )
+    }
+
+    /// AVX2 8-block decoder. 40 chars → 32 bytes.
+    ///
+    /// Returns `Err(())` (so the caller falls back to the scalar path,
+    /// which will surface a precise [`crate::DecodeError`]) when:
+    /// - any input byte is outside the base85 alphabet, or
+    /// - any of the eight 5-char blocks decodes to a value > `u32::MAX`.
+    ///
+    /// # Safety
+    /// Requires AVX2 on the host CPU.
+    #[inline]
+    #[target_feature(enable = "avx2")]
+    pub(crate) unsafe fn try_decode_block_x8(
+        input: &[u8; 40],
+        out: &mut [u8; 32],
+    ) -> Result<(), ()> {
+        // Layout the 40 chars into the 256-bit register so each
+        // 128-bit lane handles 4 blocks. Lane 0 ← chars 0..15
+        // (first 16 chars of blocks 0..3); lane 1 ← chars 20..35
+        // (first 16 chars of blocks 4..7). The 4-char tails (chars
+        // 16..19 and 36..39) go into a separate "chars_tail" with
+        // 12 zero-padding bytes per lane.
+        let chars_main_lo = _mm_loadu_si128(input.as_ptr().cast::<__m128i>());
+        let chars_main_hi = _mm_loadu_si128(input.as_ptr().add(20).cast::<__m128i>());
+        let chars_main = _mm256_set_m128i(chars_main_hi, chars_main_lo);
+
+        let mut tail_buf_lo = [0u8; 16];
+        let mut tail_buf_hi = [0u8; 16];
+        tail_buf_lo[..4].copy_from_slice(&input[16..20]);
+        tail_buf_hi[..4].copy_from_slice(&input[36..40]);
+        let chars_tail = _mm256_set_m128i(
+            _mm_loadu_si128(tail_buf_hi.as_ptr().cast::<__m128i>()),
+            _mm_loadu_si128(tail_buf_lo.as_ptr().cast::<__m128i>()),
+        );
+        let tail_valid_mask = broadcast16(TAIL_VALID_MASK.as_ptr());
+
+        // Range validation: chars must be in [33, 126].
+        let too_low_main = _mm256_cmpgt_epi8(_mm256_set1_epi8(33), chars_main);
+        let too_high_main = _mm256_cmpgt_epi8(chars_main, _mm256_set1_epi8(126));
+        let invalid_range_main = _mm256_or_si256(too_low_main, too_high_main);
+
+        let too_low_tail = _mm256_cmpgt_epi8(_mm256_set1_epi8(33), chars_tail);
+        let too_high_tail = _mm256_cmpgt_epi8(chars_tail, _mm256_set1_epi8(126));
+        let invalid_range_tail = _mm256_and_si256(
+            _mm256_or_si256(too_low_tail, too_high_tail),
+            tail_valid_mask,
+        );
+
+        // ASCII → digit lookup (0xFF for in-range invalid chars).
+        let digits_main = char85_to_byte_avx2(chars_main);
+        let digits_tail = char85_to_byte_avx2(chars_tail);
+
+        let invalid_dig_main = _mm256_cmpeq_epi8(digits_main, _mm256_set1_epi8(0xFF_u8 as i8));
+        let invalid_dig_tail = _mm256_and_si256(
+            _mm256_cmpeq_epi8(digits_tail, _mm256_set1_epi8(0xFF_u8 as i8)),
+            tail_valid_mask,
+        );
+
+        let any_invalid = _mm256_or_si256(
+            _mm256_or_si256(invalid_range_main, invalid_range_tail),
+            _mm256_or_si256(invalid_dig_main, invalid_dig_tail),
+        );
+        // _mm256_movemask_epi8 returns a 32-bit mask (one bit per
+        // byte of the 256-bit register).
+        if _mm256_movemask_epi8(any_invalid) != 0 {
+            return Err(());
+        }
+
+        // Permute digits into per-position vectors. Same indexes
+        // as the SSE path, broadcast to both lanes.
+        let v_dk = |idx_main: &[u8; 16], idx_tail: &[u8; 16]| -> __m256i {
+            let im = broadcast16(idx_main.as_ptr());
+            let it = broadcast16(idx_tail.as_ptr());
+            _mm256_or_si256(
+                _mm256_shuffle_epi8(digits_main, im),
+                _mm256_shuffle_epi8(digits_tail, it),
+            )
+        };
+        let v_d0 = v_dk(&IDX_DEC[0].0, &IDX_DEC[0].1);
+        let v_d1 = v_dk(&IDX_DEC[1].0, &IDX_DEC[1].1);
+        let v_d2 = v_dk(&IDX_DEC[2].0, &IDX_DEC[2].1);
+        let v_d3 = v_dk(&IDX_DEC[3].0, &IDX_DEC[3].1);
+        let v_d4 = v_dk(&IDX_DEC[4].0, &IDX_DEC[4].1);
+
+        // Horner: n = ((((d0·85 + d1)·85 + d2)·85 + d3)·85 + d4)
+        let m85 = _mm256_set1_epi32(85);
+        let n = _mm256_add_epi32(v_d1, _mm256_mullo_epi32(v_d0, m85));
+        let n = _mm256_add_epi32(v_d2, _mm256_mullo_epi32(n, m85));
+        let n = _mm256_add_epi32(v_d3, _mm256_mullo_epi32(n, m85));
+        let n = _mm256_add_epi32(v_d4, _mm256_mullo_epi32(n, m85));
+
+        // Overflow detection (bias-and-signed-cmpgt).
+        let d0_85_4 = _mm256_mullo_epi32(v_d0, _mm256_set1_epi32(52_200_625));
+        let bias = _mm256_set1_epi32(0x80000000_u32 as i32);
+        let n_b = _mm256_xor_si256(n, bias);
+        let d0_b = _mm256_xor_si256(d0_85_4, bias);
+        let overflow_mask = _mm256_cmpgt_epi32(d0_b, n_b);
+        if _mm256_movemask_epi8(overflow_mask) != 0 {
+            return Err(());
+        }
+
+        // Byte-swap each u32 lane to BE and store. Output is 32
+        // contiguous bytes (16 from each lane), so a single
+        // _mm256_storeu_si256 works.
+        let bytes_be = _mm256_shuffle_epi8(n, broadcast16(REV32_IDX.as_ptr()));
+        _mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>(), bytes_be);
+
+        Ok(())
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
diff --git a/src/lib.rs b/src/lib.rs
index 035490c..6a759cf 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -196,8 +196,69 @@ fn encode_into(input: &[u8], out: &mut [u8]) {
     }
 }
 
-#[cfg(not(target_arch = "aarch64"))]
+#[cfg(target_arch = "x86_64")]
 fn encode_into(input: &[u8], out: &mut [u8]) {
+    use block::Avx2Encoder;
+
+    // Runtime feature gate: the AVX2 intrinsics are gated on
+    // `#[target_feature(enable = "avx2")]`; calling them on a CPU
+    // lacking AVX2 is UB. Detect once at the entry point and fall
+    // back to the scalar path otherwise. The branch predicts
+    // perfectly across calls so the cost is negligible.
+    if !std::is_x86_feature_detected!("avx2") {
+        encode_scalar(input, out);
+        return;
+    }
+
+    let encoder = Avx2Encoder::new();
+    let mut in_off = 0usize;
+    let mut out_off = 0usize;
+
+    // 32 → 40 byte AVX2 chunks (8 blocks per iteration).
+    while in_off.saturating_add(32) <= input.len() {
+        let Some(in_chunk) = chunk_at::<32>(input, in_off) else {
+            return;
+        };
+        let Some(out_chunk) = chunk_at_mut::<40>(out, out_off) else {
+            return;
+        };
+        // SAFETY: AVX2 verified above.
+        unsafe { encoder.encode_block_x8(in_chunk, out_chunk) };
+        in_off = in_off.saturating_add(32);
+        out_off = out_off.saturating_add(40);
+    }
+
+    while in_off.saturating_add(4) <= input.len() {
+        let Some(block) = chunk_at::<4>(input, in_off) else {
+            return;
+        };
+        let Some(chunk) = chunk_at_mut::<5>(out, out_off) else {
+            return;
+        };
+        encode_block(block, chunk);
+        in_off = in_off.saturating_add(4);
+        out_off = out_off.saturating_add(5);
+    }
+
+    let tail_in_len = input.len().saturating_sub(in_off);
+    if tail_in_len > 0 {
+        let tail_out_len = tail_in_len.saturating_add(1);
+        if let (Some(tail_in), Some(tail_out)) = (
+            input.get(in_off..in_off.saturating_add(tail_in_len)),
+            out.get_mut(out_off..out_off.saturating_add(tail_out_len)),
+        ) {
+            encode_tail(tail_in, tail_out);
+        }
+    }
+}
+
+#[cfg(not(any(target_arch = "aarch64", target_arch = "x86_64")))]
+fn encode_into(input: &[u8], out: &mut [u8]) {
+    encode_scalar(input, out);
+}
+
+#[cfg(not(target_arch = "aarch64"))]
+fn encode_scalar(input: &[u8], out: &mut [u8]) {
     let mut in_off = 0usize;
     let mut out_off = 0usize;
 
@@ -290,8 +351,80 @@ fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> {
     Ok(())
 }
 
-#[cfg(not(target_arch = "aarch64"))]
+#[cfg(target_arch = "x86_64")]
+#[allow(
+    clippy::unwrap_used,
+    clippy::indexing_slicing,
+    reason = "infallible / in-bounds by loop-condition invariant; see SAFETY comment"
+)]
 fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> {
+    use block::try_decode_block_x8;
+
+    // Runtime feature gate: AVX2 needed for the fast path; otherwise
+    // route to scalar fallback.
+    if !std::is_x86_feature_detected!("avx2") {
+        return decode_scalar(input, out);
+    }
+
+    let mut in_off = 0usize;
+    let mut out_off = 0usize;
+
+    // 40-char (8-block) AVX2 chunks. On per-chunk failure (invalid
+    // char or u32 overflow), replay the same chunk scalar so the
+    // resulting DecodeError carries the precise failing byte position.
+    while in_off + 40 <= input.len() && out_off + 32 <= out.len() {
+        let Some(in_chunk) = chunk_at::<40>(input, in_off) else {
+            return Ok(());
+        };
+        let Some(out_chunk) = chunk_at_mut::<32>(out, out_off) else {
+            return Ok(());
+        };
+
+        // SAFETY: AVX2 verified above.
+        let avx_ok = unsafe { try_decode_block_x8(in_chunk, out_chunk) }.is_ok();
+        if avx_ok {
+            in_off += 40;
+            out_off += 32;
+        } else {
+            for _ in 0..8 {
+                let block: &[u8; 5] = (&input[in_off..in_off + 5]).try_into().unwrap();
+                let chunk: &mut [u8; 4] = (&mut out[out_off..out_off + 4]).try_into().unwrap();
+                decode_block(block, chunk, in_off)?;
+                in_off += 5;
+                out_off += 4;
+            }
+        }
+    }
+
+    while in_off + 5 <= input.len() && out_off + 4 <= out.len() {
+        let block: &[u8; 5] = (&input[in_off..in_off + 5]).try_into().unwrap();
+        let chunk: &mut [u8; 4] = (&mut out[out_off..out_off + 4]).try_into().unwrap();
+        decode_block(block, chunk, in_off)?;
+        in_off += 5;
+        out_off += 4;
+    }
+
+    let tail_in_len = input.len().saturating_sub(in_off);
+    if tail_in_len > 0 {
+        let tail_out_len = tail_in_len.saturating_sub(1);
+        if let (Some(tail_in), Some(tail_out)) = (
+            input.get(in_off..in_off.saturating_add(tail_in_len)),
+            out.get_mut(out_off..out_off.saturating_add(tail_out_len)),
+        ) {
+            decode_tail(tail_in, tail_out, in_off)?;
+        }
+    }
+
+    Ok(())
+}
+
+#[cfg(not(any(target_arch = "aarch64", target_arch = "x86_64")))]
+fn decode_into(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> {
+    decode_scalar(input, out)
+}
+
+#[cfg(not(target_arch = "aarch64"))]
+fn decode_scalar(input: &[u8], out: &mut [u8]) -> Result<(), DecodeError> {
     let mut in_off = 0usize;
     let mut out_off = 0usize;
 
diff --git a/src/ops/aarch64.rs b/src/ops/aarch64.rs
index 1fa0a2f..0f017d8 100644
--- a/src/ops/aarch64.rs
+++ b/src/ops/aarch64.rs
@@ -43,62 +43,3 @@ fn div_magic<const MAGIC: u32, const SHIFT: i32>(input: uint32x4_t) -> uint32x4_
         vshrq_n_u32(high_u32, SHIFT)
     }
 }
-
-#[cfg(test)]
-mod tests {
-    use super::*;
-    use quickcheck::{Arbitrary, quickcheck};
-    use std::{
-        arch::aarch64::{vld1q_u32, vst1q_u32},
-        array,
-    };
-
-    #[derive(Debug, Clone, Copy)]
-    struct InputBlock([u32; 4]);
-
-    impl Arbitrary for InputBlock {
-        fn arbitrary(g: &mut quickcheck::Gen) -> Self {
-            InputBlock(array::from_fn(|_| u32::arbitrary(g)))
-        }
-    }
-
-    fn lanes(v: uint32x4_t) -> [u32; 4] {
-        let mut out = [0u32; 4];
-        unsafe { vst1q_u32(out.as_mut_ptr(), v) };
-        out
-    }
-
-    fn load(input: &[u32; 4]) -> uint32x4_t {
-        unsafe { vld1q_u32(input.as_ptr()) }
-    }
-
-    quickcheck! {
-        fn div_85_matches_scalar(block: InputBlock) -> bool {
-            let InputBlock(input) = block;
-            let q_out = lanes(div_85(load(&input)));
-            (0..4).all(|i| q_out[i] == input[i] / 85)
-        }
-
-        fn div_85_sq_matches_scalar(block: InputBlock) -> bool {
-            let InputBlock(input) = block;
-            let q_out = lanes(div_85_sq(load(&input)));
-            (0..4).all(|i| q_out[i] == input[i] / (85 * 85))
-        }
-
-        fn div_85_cube_matches_scalar(block: InputBlock) -> bool {
-            let InputBlock(input) = block;
-            let q_out = lanes(div_85_cube(load(&input)));
-            (0..4).all(|i| q_out[i] == input[i] / (85 * 85 * 85))
-        }
-
-        // q4 = q2 / 85^2 — the composed path that yields n / 85^4. Direct
-        // /85^4 magic would not satisfy the libdivide precision bound.
-        fn div_85_to_the_4_via_composition(block: InputBlock) -> bool {
-            let InputBlock(input) = block;
-            let q2 = div_85_sq(load(&input));
-            let q4 = div_85_sq(q2);
-            let q4_out = lanes(q4);
-            (0..4).all(|i| q4_out[i] == input[i] / (85u32.pow(4)))
-        }
-    }
-}
diff --git a/src/ops/mod.rs b/src/ops/mod.rs
index aee763e..369e3b2 100644
--- a/src/ops/mod.rs
+++ b/src/ops/mod.rs
@@ -2,12 +2,18 @@
 //!
 //! Production scalar code uses plain `n / 85` / `n % 85` because the
 //! compiler lowers those to the libdivide-style multiply-then-shift
-//! sequence in release builds. The hand-written NEON helpers in
-//! [`aarch64`] cover the parallel-divide-by-`85ᵏ` cases the compiler
-//! cannot auto-vectorise across u32x4 lanes.
+//! sequence in release builds. The hand-written SIMD helpers below
+//! cover the parallel-divide-by-`85ᵏ` cases the compiler cannot
+//! auto-vectorise across u32x4 lanes.
 
 #[cfg(target_arch = "aarch64")]
 mod aarch64;
 
 #[cfg(target_arch = "aarch64")]
 pub(crate) use aarch64::{div_85, div_85_cube, div_85_sq};
+
+#[cfg(target_arch = "x86_64")]
+mod x86_64;
+
+#[cfg(target_arch = "x86_64")]
+pub(crate) use x86_64::{div_85, div_85_cube, div_85_sq};
diff --git a/src/ops/x86_64.rs b/src/ops/x86_64.rs
new file mode 100644
index 0000000..91dfac5
--- /dev/null
+++ b/src/ops/x86_64.rs
@@ -0,0 +1,107 @@
+//! x86_64 AVX2 helpers — structural mirror of `aarch64.rs`, doubled.
+//!
+//! Every helper operates on a `__m256i` carrying **eight** u32 lanes
+//! (vs the four-lane `uint32x4_t` of NEON). The lower 128-bit half
+//! handles blocks 0..3 and the upper half handles blocks 4..7;
+//! AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE)
+//! work fine because the algorithm is fully per-lane — each 128-bit
+//! half runs an independent copy of the SSE4.1 algorithm we replaced.
+//!
+//! The magic constants are the same libdivide values used on aarch64;
+//! only the SIMD intrinsics differ. AVX2 instructions used:
+//!
+//! | NEON                   | AVX2                          | what it does                        |
+//! |------------------------|-------------------------------|-------------------------------------|
+//! | `vmull_u32`            | `_mm256_mul_epu32`            | u32×u32 → u64 widening (4 lanes)    |
+//! | `vshrq_n_u32`          | `_mm256_srli_epi32`           | logical shift right per u32 lane    |
+//! | `vmulq_n_u32`          | `_mm256_mullo_epi32`          | u32×u32 → u32 (low 32 bits, 8 lanes)|
+//! | `vsubq_u32`            | `_mm256_sub_epi32`            | u32 lane-wise subtract              |
+//! | `vqtbl1q_u8`           | `_mm256_shuffle_epi8`         | byte-shuffle, **per 128-bit lane**  |
+//! | `vrev32q_u8`           | `_mm256_shuffle_epi8` + idx   | reverse bytes in each u32 lane      |
+//!
+//! All SIMD entry points carry `#[target_feature(enable = "avx2")]`
+//! so the file compiles on any `x86_64-*-*` target regardless of the
+//! default target-feature set. Callers must perform a runtime
+//! `is_x86_feature_detected!("avx2")` check before invoking; on hosts
+//! lacking AVX2 the crate routes to the scalar fallback.
+//!
+//! The module-level `#![allow(unsafe_op_in_unsafe_fn)]` keeps the body
+//! of each `unsafe fn` an implicit unsafe scope on MSRV 1.85 (without
+//! it, each intrinsic call needs its own `unsafe { … }`).
+
+#![allow(unsafe_op_in_unsafe_fn)]
+
+use std::arch::x86_64::{
+    __m256i, _mm256_blend_epi16, _mm256_mul_epu32, _mm256_set1_epi32, _mm256_shuffle_epi32,
+    _mm256_srli_epi32, _mm256_srli_epi64,
+};
+
+/// Lane-wise `input / 85` (8 u32 lanes). Magic verified for full u32 range.
+///
+/// # Safety
+/// Requires AVX2 to be available on the host CPU.
+#[inline]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn div_85(input: __m256i) -> __m256i {
+    // m = ceil(2^38 / 85) = 3_233_857_729; m * 85 - 2^38 = 21 < 2^6.
+    div_magic::<3_233_857_729, 6>(input)
+}
+
+/// Lane-wise `input / 7225` (i.e. `input / 85²`). Valid for all u32 inputs.
+///
+/// # Safety
+/// Requires AVX2 to be available on the host CPU.
+#[inline]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn div_85_sq(input: __m256i) -> __m256i {
+    // m = ceil(2^44 / 7225) = 2_434_904_643; m * 7225 - 2^44 = 1259 < 2^12.
+    div_magic::<2_434_904_643, 12>(input)
+}
+
+/// Lane-wise `input / 614125` (i.e. `input / 85³`). Valid for all u32 inputs.
+///
+/// # Safety
+/// Requires AVX2 to be available on the host CPU.
+#[inline]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn div_85_cube(input: __m256i) -> __m256i {
+    // m = ceil(2^51 / 614125) = 3_666_679_933; bound check passes for u32.
+    div_magic::<3_666_679_933, 19>(input)
+}
+
+/// Generic libdivide-style div over 8 u32 lanes:
+/// `q = (n * MAGIC) >> (32 + SHIFT)`.
+///
+/// Same algorithm as the SSE4.1 version it replaces, just doubled.
+/// `_mm256_mul_epu32` multiplies the low u32 of each u64 lane (so it
+/// touches input lanes 0, 2, 4, 6 per call); we do two multiplies —
+/// one as-is, one after `_mm256_srli_epi64::<32>` — and combine the
+/// high u32 of each u64 product via `_mm256_shuffle_epi32::<0xF5>` +
+/// `_mm256_blend_epi16::<0xCC>`. Both of those are lane-restricted ops
+/// (per 128-bit half), but our blend pattern lives entirely within each
+/// half so that's fine.
+///
+/// # Safety
+/// Requires AVX2 (the function-level `target_feature` satisfies this
+/// for any caller that has done the runtime check).
+#[inline]
+#[target_feature(enable = "avx2")]
+unsafe fn div_magic<const MAGIC: u32, const SHIFT: i32>(input: __m256i) -> __m256i {
+    let magic = _mm256_set1_epi32(MAGIC as i32);
+
+    let even = _mm256_mul_epu32(input, magic);
+    let input_odd = _mm256_srli_epi64::<32>(input);
+    let odd = _mm256_mul_epu32(input_odd, magic);
+
+    // Per-128-bit-lane shuffle: broadcasts hi u32 of each u64 product
+    // across the pair, in each lane independently.
+    let even_hi = _mm256_shuffle_epi32::<0b11_11_01_01>(even);
+    let odd_hi = _mm256_shuffle_epi32::<0b11_11_01_01>(odd);
+
+    // Per-lane blend at u16 granularity. The 8-bit MASK is applied to
+    // each 128-bit lane, so 0xCC interleaves [hi0, hi1, hi2, hi3]
+    // within each lane independently.
+    let combined = _mm256_blend_epi16::<0xCC>(even_hi, odd_hi);
+
+    _mm256_srli_epi32::<SHIFT>(combined)
+}