Guard bench workflow on SSE4.1 + SSSE3 + AVX2#8
Merged
Conversation
The bench job's whole point is to compare scalar vs SIMD on the same silicon. Without the right CPU features the crate's runtime feature gate would silently route to the scalar fallback, turning the benchmark into apples-to-apples scalar — a misleading "speedup: 1.00×" with no signal that anything went wrong. Add a fail-fast step that greps `/proc/cpuinfo` for the required features and aborts before `cargo bench` runs. GitHub annotation syntax (`::error::`) surfaces the failure clearly in the Actions UI. Required features: - sse4.1 + ssse3 — needed for the existing SSE encode/decode path. - avx2 — needed for the upcoming AVX2 fast path. All three have been universal on x86 server hardware since ~2013 and are present on the current GitHub-hosted Ubuntu runner fleet (Azure Standard_D4ads_v5 / AMD EPYC 7763 / Zen 3).
coderdan
added a commit
that referenced
this pull request
May 2, 2026
The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each __m256i register independently run the algorithm we developed for SSE4.1, doubling per-call throughput. AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every step of the algorithm is per-lane anyway. # Why drop SSE4.1 entirely User direction: "if a target doesn't have avx2 it can fall back to the soft impl". AVX2 has been universal on x86 server hardware since Haswell (2013) and the bench workflow now guards on its presence (PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't worth the duplicated code. # What changed src/ops/x86_64.rs: - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i. - target_feature attribute changed from "sse4.1,ssse3" to "avx2". src/block.rs::sse → src/block.rs::avx2: - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8 (16-byte → 32-byte input, 20-byte → 40-byte output). - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in, 16-byte → 32-byte out). - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 + _mm256_broadcastsi128_si256 to populate both 128-bit lanes. - Encoder output is non-contiguous between lanes (lane 0 → out[0..20], lane 1 → out[20..40]) so we extract each 128-bit lane and the two 4-byte tails separately. Decoder output is 32 contiguous bytes so a single _mm256_storeu_si256 works. - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap at the tail boundary of each side); two _mm_loadu_si128s are combined with _mm256_set_m128i. src/lib.rs: - encode_into / decode_into runtime gate switches from is_x86_feature_detected!("sse4.1") && ...("ssse3") to just ...("avx2"). Falls back to the existing scalar paths otherwise. - Loop strides updated for the new chunk sizes (32→40 encode, 40→32 decode). # Verified locally cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta; hits the scalar fallback because Rosetta on this Mac doesn't expose AVX2) cargo test — 47 tests (aarch64 native) cargo +1.85 test --target x86_64-apple-darwin — 47 tests cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run that actually exercises the new AVX2 code paths end-to-end.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a fail-fast step to the `bench` workflow that greps `/proc/cpuinfo` for SSE4.1 + SSSE3 + AVX2 and aborts before `cargo bench` if any are missing. AVX2 is added in anticipation of the upcoming AVX2 fast path on x86.
Why
The bench's job is to compare scalar vs SIMD on the same silicon. If GitHub ever rotates the runner fleet to hardware lacking the required features, the crate's runtime feature gate would route silently to scalar and the bench would degenerate into apples-to-apples scalar — a "speedup: 1.00×" result with no obvious signal that anything went wrong. Better to fail loudly.
The check uses GitHub's `::error::` annotation syntax so the Actions UI surfaces the failure clearly. Cost is one bash step (~10ms); no impact on the actual bench run.