Summary
Investigate AVX-512 VBMI2 compress/expand instructions (vpcompressb/vpexpandb) for potential use in hybrid sorting approaches beyond the current fixed sorting networks.
Current Architecture (Source Generators)
The library now uses a Roslyn incremental source generator (SortingNetworkGenerator). Users apply [SortingNetwork(size, typeof(T))] attributes to a partial class, and the generator emits:
Sort(Span<T>) / Sort(T[]) overloads with inline SIMD dispatch
Sort(Span<T>, IComparer<T>?) / Sort(T[], IComparer<T>?) overloads using a loop-based ApplyNetworkWithComparer path
- Optional
OnFallback(Span<T>) / OnFallback(Span<T>, IComparer<T>) static methods for unsupported sizes
Network sizes: 2–64 elements (optimal networks for 2–16, best-known for 17–64 via Dobbelaere's SorterHunter).
SIMD support already implemented:
| ISA |
Types |
Max Elements |
| AVX2 |
byte, sbyte, int, uint, float |
64 (1-byte), 64 (4-byte) |
| AVX-512F |
long, ulong, double |
32 |
| AVX-512BW |
short, ushort, char |
64 |
| AVX-512 VBMI |
byte, sbyte (sizes > 32) |
64 |
| ARM AdvSimd (NEON) |
byte, sbyte, short, ushort, char, int, uint, float |
32 |
Scalar fallback: Unrolled compare-and-swap using ref + Unsafe.Add for all types, including string (via string.CompareOrdinal) and custom IComparable<T> value types.
Fallback for unsupported sizes: If the class declares static void OnFallback(Span<T>), the generator calls it for sizes without a configured network. Otherwise, it throws ArgumentException. There is no automatic span.Sort() fallback.
.NET 10 API Status
Avx512Vbmi2.Compress is available and shipping in .NET 10:
Avx512Vbmi2.Compress(Vector512<byte> merge, Vector512<byte> mask, Vector512<byte> value) — maps to vpcompressb
- Overloads for
short, and for 128/256/512-bit widths via Avx512Vbmi2.VL
- Runtime detection:
Avx512Vbmi2.IsSupported
- dotnet/runtime#108109 (closed) — merge-masking semantics for Compress are implemented
Expand (vpexpandb) status: Not clearly surfaced as a named Expand method in the .NET 10 Avx512Vbmi2 API docs. May need Avx512BW or manual implementation. Worth verifying.
References:
Hardware Availability
| CPU |
VBMI2 |
vpcompressb |
| Intel Ice Lake (2019+) |
Yes |
Yes |
| Intel Sapphire Rapids (2023+) |
Yes |
Yes |
| AMD Zen 4 / EPYC Genoa (2022+) |
Yes |
Yes |
| Intel Alder Lake (12th gen) |
No |
No |
| AMD Zen 3 and earlier |
No |
No |
VBMI2 is available on modern server and desktop CPUs from both Intel and AMD. Notably, Zen 4 (Ryzen 7000 / EPYC Genoa) has full VBMI2 support including vpcompressb.
Use Cases for SortingNetworks
1. SIMD Partition (highest value for this library)
Compress is the backbone of vectorized quicksort partitioning. Given a pivot, you can:
var data = Avx512BW.LoadVector512(ref element);
var mask = Avx512BW.CompareGreaterThan(data, pivot); // which elements > pivot
var left = Avx512Vbmi2.Compress(data, ~mask); // elements <= pivot, packed
var right = Avx512Vbmi2.Compress(data, mask); // elements > pivot, packed
This enables a vectorized partition step processing 64 bytes at a time — the key primitive for SIMD quicksort.
Relevance to this library: Currently, arrays with n > 64 have no built-in sorting — users must supply an OnFallback method. A SIMD partition + recursive network sort hybrid could handle larger arrays natively:
- Use SIMD partition to split the array
- Recursively partition until sub-arrays reach n <= 64
- Apply the existing sorting network (with SIMD dispatch) to each sub-array
- This could be emitted by the source generator as part of the
OnFallback path, or as a separate generated method
2. Partial Sort / Top-K
A SIMD quickselect using compress/partition could efficiently find the K smallest/largest elements, then sort only those K elements with the network.
3. nth_element
Same partition primitive enables O(n) expected-time selection of the nth element — useful as a building block.
Prior Art — Intel x86-simd-sort
Intel's open-source x86-simd-sort library demonstrates this approach at production quality:
- 10-17x faster sorting than scalar in NumPy/PyTorch (integrated upstream)
- Hybrid quicksort: SIMD partition + bitonic sort for small arrays
- Supports partial sort, nth_element, and argsort
- Up to 15-20x speedup for partial sort and nth_element on 32-bit data
- Available benchmarks and detailed analysis: sort-research-rs writeup
Key paper: Bramas, "A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512" (arXiv:1704.08579) — 4x faster than GNU std::sort, 1.4x faster than Intel IPP.
Implementation Considerations
This would expand the library's scope beyond fixed sorting networks:
Current scope: Source-generated fixed sorting networks for n=2..64, with user-supplied OnFallback for larger sizes. SIMD dispatch (AVX2/AVX-512/ARM NEON) is already emitted by the generator for supported types.
How this could fit the source generator model:
- Conservative — The generator emits a built-in
OnFallback implementation using SIMD partition that recursively breaks arrays down to network-sortable sizes (n <= 64). Users opt in via an attribute flag (e.g., [SortingNetwork(64, typeof(int), Fallback = true)]). Public API stays the same.
- Moderate — New attributes for
PartialSort / NthElement that emit additional generated methods alongside Sort.
- Ambitious — Full SIMD hybrid sort for arbitrary n, competing with
Array.Sort().
Recommendation
Priority: Low (research/exploratory). The .NET 10 APIs exist and the hardware is widely available. The most practical path is option 1 (conservative) — emitting an OnFallback that uses SIMD partition internally, keeping the fixed sorting networks as the base case for small sub-arrays. This fits naturally into the existing source generator architecture since it would be a new emitter alongside ScalarEmitter, SimdX86Emitter, and SimdArmEmitter.
Summary
Investigate AVX-512 VBMI2 compress/expand instructions (
vpcompressb/vpexpandb) for potential use in hybrid sorting approaches beyond the current fixed sorting networks.Current Architecture (Source Generators)
The library now uses a Roslyn incremental source generator (
SortingNetworkGenerator). Users apply[SortingNetwork(size, typeof(T))]attributes to apartial class, and the generator emits:Sort(Span<T>)/Sort(T[])overloads with inline SIMD dispatchSort(Span<T>, IComparer<T>?)/Sort(T[], IComparer<T>?)overloads using a loop-basedApplyNetworkWithComparerpathOnFallback(Span<T>)/OnFallback(Span<T>, IComparer<T>)static methods for unsupported sizesNetwork sizes: 2–64 elements (optimal networks for 2–16, best-known for 17–64 via Dobbelaere's SorterHunter).
SIMD support already implemented:
byte,sbyte,int,uint,floatlong,ulong,doubleshort,ushort,charbyte,sbyte(sizes > 32)byte,sbyte,short,ushort,char,int,uint,floatScalar fallback: Unrolled compare-and-swap using
ref+Unsafe.Addfor all types, includingstring(viastring.CompareOrdinal) and customIComparable<T>value types.Fallback for unsupported sizes: If the class declares
static void OnFallback(Span<T>), the generator calls it for sizes without a configured network. Otherwise, it throwsArgumentException. There is no automaticspan.Sort()fallback..NET 10 API Status
Avx512Vbmi2.Compressis available and shipping in .NET 10:Avx512Vbmi2.Compress(Vector512<byte> merge, Vector512<byte> mask, Vector512<byte> value)— maps tovpcompressbshort, and for 128/256/512-bit widths viaAvx512Vbmi2.VLAvx512Vbmi2.IsSupportedExpand (
vpexpandb) status: Not clearly surfaced as a namedExpandmethod in the .NET 10Avx512Vbmi2API docs. May needAvx512BWor manual implementation. Worth verifying.References:
Hardware Availability
VBMI2 is available on modern server and desktop CPUs from both Intel and AMD. Notably, Zen 4 (Ryzen 7000 / EPYC Genoa) has full VBMI2 support including
vpcompressb.Use Cases for SortingNetworks
1. SIMD Partition (highest value for this library)
Compress is the backbone of vectorized quicksort partitioning. Given a pivot, you can:
This enables a vectorized partition step processing 64 bytes at a time — the key primitive for SIMD quicksort.
Relevance to this library: Currently, arrays with n > 64 have no built-in sorting — users must supply an
OnFallbackmethod. A SIMD partition + recursive network sort hybrid could handle larger arrays natively:OnFallbackpath, or as a separate generated method2. Partial Sort / Top-K
A SIMD quickselect using compress/partition could efficiently find the K smallest/largest elements, then sort only those K elements with the network.
3. nth_element
Same partition primitive enables O(n) expected-time selection of the nth element — useful as a building block.
Prior Art — Intel x86-simd-sort
Intel's open-source x86-simd-sort library demonstrates this approach at production quality:
Key paper: Bramas, "A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512" (arXiv:1704.08579) — 4x faster than GNU
std::sort, 1.4x faster than Intel IPP.Implementation Considerations
This would expand the library's scope beyond fixed sorting networks:
Current scope: Source-generated fixed sorting networks for n=2..64, with user-supplied
OnFallbackfor larger sizes. SIMD dispatch (AVX2/AVX-512/ARM NEON) is already emitted by the generator for supported types.How this could fit the source generator model:
OnFallbackimplementation using SIMD partition that recursively breaks arrays down to network-sortable sizes (n <= 64). Users opt in via an attribute flag (e.g.,[SortingNetwork(64, typeof(int), Fallback = true)]). Public API stays the same.PartialSort/NthElementthat emit additional generated methods alongsideSort.Array.Sort().Recommendation
Priority: Low (research/exploratory). The .NET 10 APIs exist and the hardware is widely available. The most practical path is option 1 (conservative) — emitting an
OnFallbackthat uses SIMD partition internally, keeping the fixed sorting networks as the base case for small sub-arrays. This fits naturally into the existing source generator architecture since it would be a new emitter alongsideScalarEmitter,SimdX86Emitter, andSimdArmEmitter.