Skip to content

Research: AVX-512 VBMI2 compress/expand for hybrid sorting #34

@jonathanpeppers

Description

@jonathanpeppers

Summary

Investigate AVX-512 VBMI2 compress/expand instructions (vpcompressb/vpexpandb) for potential use in hybrid sorting approaches beyond the current fixed sorting networks.

Current Architecture (Source Generators)

The library now uses a Roslyn incremental source generator (SortingNetworkGenerator). Users apply [SortingNetwork(size, typeof(T))] attributes to a partial class, and the generator emits:

  • Sort(Span<T>) / Sort(T[]) overloads with inline SIMD dispatch
  • Sort(Span<T>, IComparer<T>?) / Sort(T[], IComparer<T>?) overloads using a loop-based ApplyNetworkWithComparer path
  • Optional OnFallback(Span<T>) / OnFallback(Span<T>, IComparer<T>) static methods for unsupported sizes

Network sizes: 2–64 elements (optimal networks for 2–16, best-known for 17–64 via Dobbelaere's SorterHunter).

SIMD support already implemented:

ISA Types Max Elements
AVX2 byte, sbyte, int, uint, float 64 (1-byte), 64 (4-byte)
AVX-512F long, ulong, double 32
AVX-512BW short, ushort, char 64
AVX-512 VBMI byte, sbyte (sizes > 32) 64
ARM AdvSimd (NEON) byte, sbyte, short, ushort, char, int, uint, float 32

Scalar fallback: Unrolled compare-and-swap using ref + Unsafe.Add for all types, including string (via string.CompareOrdinal) and custom IComparable<T> value types.

Fallback for unsupported sizes: If the class declares static void OnFallback(Span<T>), the generator calls it for sizes without a configured network. Otherwise, it throws ArgumentException. There is no automatic span.Sort() fallback.

.NET 10 API Status

Avx512Vbmi2.Compress is available and shipping in .NET 10:

  • Avx512Vbmi2.Compress(Vector512<byte> merge, Vector512<byte> mask, Vector512<byte> value) — maps to vpcompressb
  • Overloads for short, and for 128/256/512-bit widths via Avx512Vbmi2.VL
  • Runtime detection: Avx512Vbmi2.IsSupported
  • dotnet/runtime#108109 (closed) — merge-masking semantics for Compress are implemented

Expand (vpexpandb) status: Not clearly surfaced as a named Expand method in the .NET 10 Avx512Vbmi2 API docs. May need Avx512BW or manual implementation. Worth verifying.

References:

Hardware Availability

CPU VBMI2 vpcompressb
Intel Ice Lake (2019+) Yes Yes
Intel Sapphire Rapids (2023+) Yes Yes
AMD Zen 4 / EPYC Genoa (2022+) Yes Yes
Intel Alder Lake (12th gen) No No
AMD Zen 3 and earlier No No

VBMI2 is available on modern server and desktop CPUs from both Intel and AMD. Notably, Zen 4 (Ryzen 7000 / EPYC Genoa) has full VBMI2 support including vpcompressb.

Use Cases for SortingNetworks

1. SIMD Partition (highest value for this library)

Compress is the backbone of vectorized quicksort partitioning. Given a pivot, you can:

var data = Avx512BW.LoadVector512(ref element);
var mask = Avx512BW.CompareGreaterThan(data, pivot);  // which elements > pivot
var left = Avx512Vbmi2.Compress(data, ~mask);          // elements <= pivot, packed
var right = Avx512Vbmi2.Compress(data, mask);           // elements > pivot, packed

This enables a vectorized partition step processing 64 bytes at a time — the key primitive for SIMD quicksort.

Relevance to this library: Currently, arrays with n > 64 have no built-in sorting — users must supply an OnFallback method. A SIMD partition + recursive network sort hybrid could handle larger arrays natively:

  • Use SIMD partition to split the array
  • Recursively partition until sub-arrays reach n <= 64
  • Apply the existing sorting network (with SIMD dispatch) to each sub-array
  • This could be emitted by the source generator as part of the OnFallback path, or as a separate generated method

2. Partial Sort / Top-K

A SIMD quickselect using compress/partition could efficiently find the K smallest/largest elements, then sort only those K elements with the network.

3. nth_element

Same partition primitive enables O(n) expected-time selection of the nth element — useful as a building block.

Prior Art — Intel x86-simd-sort

Intel's open-source x86-simd-sort library demonstrates this approach at production quality:

  • 10-17x faster sorting than scalar in NumPy/PyTorch (integrated upstream)
  • Hybrid quicksort: SIMD partition + bitonic sort for small arrays
  • Supports partial sort, nth_element, and argsort
  • Up to 15-20x speedup for partial sort and nth_element on 32-bit data
  • Available benchmarks and detailed analysis: sort-research-rs writeup

Key paper: Bramas, "A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512" (arXiv:1704.08579) — 4x faster than GNU std::sort, 1.4x faster than Intel IPP.

Implementation Considerations

This would expand the library's scope beyond fixed sorting networks:

Current scope: Source-generated fixed sorting networks for n=2..64, with user-supplied OnFallback for larger sizes. SIMD dispatch (AVX2/AVX-512/ARM NEON) is already emitted by the generator for supported types.

How this could fit the source generator model:

  1. Conservative — The generator emits a built-in OnFallback implementation using SIMD partition that recursively breaks arrays down to network-sortable sizes (n <= 64). Users opt in via an attribute flag (e.g., [SortingNetwork(64, typeof(int), Fallback = true)]). Public API stays the same.
  2. Moderate — New attributes for PartialSort / NthElement that emit additional generated methods alongside Sort.
  3. Ambitious — Full SIMD hybrid sort for arbitrary n, competing with Array.Sort().

Recommendation

Priority: Low (research/exploratory). The .NET 10 APIs exist and the hardware is widely available. The most practical path is option 1 (conservative) — emitting an OnFallback that uses SIMD partition internally, keeping the fixed sorting networks as the base case for small sub-arrays. This fits naturally into the existing source generator architecture since it would be a new emitter alongside ScalarEmitter, SimdX86Emitter, and SimdArmEmitter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions