Skip to content

Guard bench workflow on SSE4.1 + SSSE3 + AVX2#8

Merged
coderdan merged 1 commit intomainfrom
bench-avx2-guard
May 2, 2026
Merged

Guard bench workflow on SSE4.1 + SSSE3 + AVX2#8
coderdan merged 1 commit intomainfrom
bench-avx2-guard

Conversation

@coderdan
Copy link
Copy Markdown
Collaborator

@coderdan coderdan commented May 2, 2026

Summary

Adds a fail-fast step to the `bench` workflow that greps `/proc/cpuinfo` for SSE4.1 + SSSE3 + AVX2 and aborts before `cargo bench` if any are missing. AVX2 is added in anticipation of the upcoming AVX2 fast path on x86.

Why

The bench's job is to compare scalar vs SIMD on the same silicon. If GitHub ever rotates the runner fleet to hardware lacking the required features, the crate's runtime feature gate would route silently to scalar and the bench would degenerate into apples-to-apples scalar — a "speedup: 1.00×" result with no obvious signal that anything went wrong. Better to fail loudly.

The check uses GitHub's `::error::` annotation syntax so the Actions UI surfaces the failure clearly. Cost is one bash step (~10ms); no impact on the actual bench run.

The bench job's whole point is to compare scalar vs SIMD on the
same silicon. Without the right CPU features the crate's runtime
feature gate would silently route to the scalar fallback, turning
the benchmark into apples-to-apples scalar — a misleading
"speedup: 1.00×" with no signal that anything went wrong.

Add a fail-fast step that greps `/proc/cpuinfo` for the required
features and aborts before `cargo bench` runs. GitHub annotation
syntax (`::error::`) surfaces the failure clearly in the Actions UI.

Required features:
  - sse4.1 + ssse3 — needed for the existing SSE encode/decode path.
  - avx2          — needed for the upcoming AVX2 fast path.

All three have been universal on x86 server hardware since ~2013
and are present on the current GitHub-hosted Ubuntu runner fleet
(Azure Standard_D4ads_v5 / AMD EPYC 7763 / Zen 3).
@coderdan coderdan merged commit b5a9dca into main May 2, 2026
9 checks passed
coderdan added a commit that referenced this pull request May 2, 2026
The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per
call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each
__m256i register independently run the algorithm we developed for
SSE4.1, doubling per-call throughput. AVX2's lane-restricted
byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every
step of the algorithm is per-lane anyway.

# Why drop SSE4.1 entirely

User direction: "if a target doesn't have avx2 it can fall back to
the soft impl". AVX2 has been universal on x86 server hardware since
Haswell (2013) and the bench workflow now guards on its presence
(PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't
worth the duplicated code.

# What changed

src/ops/x86_64.rs:
  - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i.
  - target_feature attribute changed from "sse4.1,ssse3" to "avx2".

src/block.rs::sse → src/block.rs::avx2:
  - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8
    (16-byte → 32-byte input, 20-byte → 40-byte output).
  - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in,
    16-byte → 32-byte out).
  - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 +
    _mm256_broadcastsi128_si256 to populate both 128-bit lanes.
  - Encoder output is non-contiguous between lanes (lane 0 → out[0..20],
    lane 1 → out[20..40]) so we extract each 128-bit lane and the
    two 4-byte tails separately. Decoder output is 32 contiguous
    bytes so a single _mm256_storeu_si256 works.
  - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap
    at the tail boundary of each side); two _mm_loadu_si128s are
    combined with _mm256_set_m128i.

src/lib.rs:
  - encode_into / decode_into runtime gate switches from
    is_x86_feature_detected!("sse4.1") && ...("ssse3") to just
    ...("avx2"). Falls back to the existing scalar paths otherwise.
  - Loop strides updated for the new chunk sizes (32→40 encode,
    40→32 decode).

# Verified locally

  cargo test --target x86_64-apple-darwin   — 47 tests pass (Rosetta;
                                              hits the scalar fallback
                                              because Rosetta on this
                                              Mac doesn't expose AVX2)
  cargo test                                — 47 tests (aarch64 native)
  cargo +1.85 test --target x86_64-apple-darwin — 47 tests
  cargo clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings            — clean
  cargo +1.85 clippy --target x86_64-apple-darwin
    --all-targets -- -D warnings            — clean
  cargo check --target x86_64-unknown-linux-gnu --all-targets — clean
  cargo fmt --all -- --check                — clean

CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run
that actually exercises the new AVX2 code paths end-to-end.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant