Guard bench workflow on SSE4.1 + SSSE3 + AVX2 by coderdan · Pull Request #8 · cipherstash/base85-simd

coderdan · 2026-05-02T03:58:18Z

Summary

Adds a fail-fast step to the `bench` workflow that greps `/proc/cpuinfo` for SSE4.1 + SSSE3 + AVX2 and aborts before `cargo bench` if any are missing. AVX2 is added in anticipation of the upcoming AVX2 fast path on x86.

Why

The bench's job is to compare scalar vs SIMD on the same silicon. If GitHub ever rotates the runner fleet to hardware lacking the required features, the crate's runtime feature gate would route silently to scalar and the bench would degenerate into apples-to-apples scalar — a "speedup: 1.00×" result with no obvious signal that anything went wrong. Better to fail loudly.

The check uses GitHub's `::error::` annotation syntax so the Actions UI surfaces the failure clearly. Cost is one bash step (~10ms); no impact on the actual bench run.

The bench job's whole point is to compare scalar vs SIMD on the same silicon. Without the right CPU features the crate's runtime feature gate would silently route to the scalar fallback, turning the benchmark into apples-to-apples scalar — a misleading "speedup: 1.00×" with no signal that anything went wrong. Add a fail-fast step that greps `/proc/cpuinfo` for the required features and aborts before `cargo bench` runs. GitHub annotation syntax (`::error::`) surfaces the failure clearly in the Actions UI. Required features: - sse4.1 + ssse3 — needed for the existing SSE encode/decode path. - avx2 — needed for the upcoming AVX2 fast path. All three have been universal on x86 server hardware since ~2013 and are present on the current GitHub-hosted Ubuntu runner fleet (Azure Standard_D4ads_v5 / AMD EPYC 7763 / Zen 3).

The x86 SIMD path now uses AVX2 throughout, processing 8 blocks per call (vs the previous SSE4.1 path's 4). Both 128-bit lanes of each __m256i register independently run the algorithm we developed for SSE4.1, doubling per-call throughput. AVX2's lane-restricted byte/word ops (PSHUFB, BLEND, SHUFFLE_EPI32) work fine because every step of the algorithm is per-lane anyway. # Why drop SSE4.1 entirely User direction: "if a target doesn't have avx2 it can fall back to the soft impl". AVX2 has been universal on x86 server hardware since Haswell (2013) and the bench workflow now guards on its presence (PR #8). Maintaining a third tier (AVX2 → SSE4.1 → scalar) wasn't worth the duplicated code. # What changed src/ops/x86_64.rs: - div_85, div_85_sq, div_85_cube, div_magic now take/return __m256i. - target_feature attribute changed from "sse4.1,ssse3" to "avx2". src/block.rs::sse → src/block.rs::avx2: - SseEncoder → Avx2Encoder; encode_block_x4 → encode_block_x8 (16-byte → 32-byte input, 20-byte → 40-byte output). - try_decode_block_x4 → try_decode_block_x8 (20-char → 40-char in, 16-byte → 32-byte out). - Static index tables stay 16 bytes; loaded via _mm_loadu_si128 + _mm256_broadcastsi128_si256 to populate both 128-bit lanes. - Encoder output is non-contiguous between lanes (lane 0 → out[0..20], lane 1 → out[20..40]) so we extract each 128-bit lane and the two 4-byte tails separately. Decoder output is 32 contiguous bytes so a single _mm256_storeu_si256 works. - Decoder input is non-contiguous (chars 0..20 + 20..40 with a gap at the tail boundary of each side); two _mm_loadu_si128s are combined with _mm256_set_m128i. src/lib.rs: - encode_into / decode_into runtime gate switches from is_x86_feature_detected!("sse4.1") && ...("ssse3") to just ...("avx2"). Falls back to the existing scalar paths otherwise. - Loop strides updated for the new chunk sizes (32→40 encode, 40→32 decode). # Verified locally cargo test --target x86_64-apple-darwin — 47 tests pass (Rosetta; hits the scalar fallback because Rosetta on this Mac doesn't expose AVX2) cargo test — 47 tests (aarch64 native) cargo +1.85 test --target x86_64-apple-darwin — 47 tests cargo clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo +1.85 clippy --target x86_64-apple-darwin --all-targets -- -D warnings — clean cargo check --target x86_64-unknown-linux-gnu --all-targets — clean cargo fmt --all -- --check — clean CI on Ubuntu x86_64 (AMD EPYC 7763, has AVX2) will be the first run that actually exercises the new AVX2 code paths end-to-end.

coderdan merged commit b5a9dca into main May 2, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guard bench workflow on SSE4.1 + SSSE3 + AVX2#8

Guard bench workflow on SSE4.1 + SSSE3 + AVX2#8
coderdan merged 1 commit intomainfrom
bench-avx2-guard

coderdan commented May 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coderdan commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderdan commented May 2, 2026 •

edited

Loading