Research: AVX-512 VBMI2 compress/expand for hybrid sorting

## Summary

Investigate AVX-512 VBMI2 compress/expand instructions (`vpcompressb`/`vpexpandb`) for potential use in hybrid sorting approaches beyond the current fixed sorting networks.

## Current Architecture (Source Generators)

The library now uses a **Roslyn incremental source generator** (`SortingNetworkGenerator`). Users apply `[SortingNetwork(size, typeof(T))]` attributes to a `partial class`, and the generator emits:

- `Sort(Span<T>)` / `Sort(T[])` overloads with inline SIMD dispatch
- `Sort(Span<T>, IComparer<T>?)` / `Sort(T[], IComparer<T>?)` overloads using a loop-based `ApplyNetworkWithComparer` path
- Optional `OnFallback(Span<T>)` / `OnFallback(Span<T>, IComparer<T>)` static methods for unsupported sizes

**Network sizes:** 2–64 elements (optimal networks for 2–16, best-known for 17–64 via Dobbelaere's SorterHunter).

**SIMD support already implemented:**

| ISA | Types | Max Elements |
|---|---|---|
| AVX2 | `byte`, `sbyte`, `int`, `uint`, `float` | 64 (1-byte), 64 (4-byte) |
| AVX-512F | `long`, `ulong`, `double` | 32 |
| AVX-512BW | `short`, `ushort`, `char` | 64 |
| AVX-512 VBMI | `byte`, `sbyte` (sizes > 32) | 64 |
| ARM AdvSimd (NEON) | `byte`, `sbyte`, `short`, `ushort`, `char`, `int`, `uint`, `float` | 32 |

**Scalar fallback:** Unrolled compare-and-swap using `ref` + `Unsafe.Add` for all types, including `string` (via `string.CompareOrdinal`) and custom `IComparable<T>` value types.

**Fallback for unsupported sizes:** If the class declares `static void OnFallback(Span<T>)`, the generator calls it for sizes without a configured network. Otherwise, it throws `ArgumentException`. There is no automatic `span.Sort()` fallback.

## .NET 10 API Status

**`Avx512Vbmi2.Compress` is available and shipping in .NET 10:**

- `Avx512Vbmi2.Compress(Vector512<byte> merge, Vector512<byte> mask, Vector512<byte> value)` — maps to `vpcompressb`
- Overloads for `short`, and for 128/256/512-bit widths via `Avx512Vbmi2.VL`
- Runtime detection: `Avx512Vbmi2.IsSupported`
- [dotnet/runtime#108109](https://github.com/dotnet/runtime/issues/108109) (closed) — merge-masking semantics for Compress are implemented

**Expand (`vpexpandb`) status:** Not clearly surfaced as a named `Expand` method in the .NET 10 `Avx512Vbmi2` API docs. May need `Avx512BW` or manual implementation. Worth verifying.

**References:**
- [Avx512Vbmi2.Compress API docs](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx512vbmi2.compress?view=net-10.0)
- [Avx512Vbmi2.VL.Compress API docs](https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.x86.avx512vbmi2.vl.compress?view=net-10.0)

## Hardware Availability

| CPU | VBMI2 | vpcompressb |
|---|---|---|
| Intel Ice Lake (2019+) | Yes | Yes |
| Intel Sapphire Rapids (2023+) | Yes | Yes |
| AMD Zen 4 / EPYC Genoa (2022+) | Yes | Yes |
| Intel Alder Lake (12th gen) | No | No |
| AMD Zen 3 and earlier | No | No |

VBMI2 is available on modern server and desktop CPUs from both Intel and AMD. Notably, Zen 4 (Ryzen 7000 / EPYC Genoa) has full VBMI2 support including `vpcompressb`.

## Use Cases for SortingNetworks

### 1. SIMD Partition (highest value for this library)

Compress is the backbone of vectorized quicksort partitioning. Given a pivot, you can:

```csharp
var data = Avx512BW.LoadVector512(ref element);
var mask = Avx512BW.CompareGreaterThan(data, pivot);  // which elements > pivot
var left = Avx512Vbmi2.Compress(data, ~mask);          // elements <= pivot, packed
var right = Avx512Vbmi2.Compress(data, mask);           // elements > pivot, packed
```

This enables a **vectorized partition step** processing 64 bytes at a time — the key primitive for SIMD quicksort.

**Relevance to this library:** Currently, arrays with n > 64 have no built-in sorting — users must supply an `OnFallback` method. A SIMD partition + recursive network sort hybrid could handle larger arrays natively:
- Use SIMD partition to split the array
- Recursively partition until sub-arrays reach n <= 64
- Apply the existing sorting network (with SIMD dispatch) to each sub-array
- This could be emitted by the source generator as part of the `OnFallback` path, or as a separate generated method

### 2. Partial Sort / Top-K

A SIMD quickselect using compress/partition could efficiently find the K smallest/largest elements, then sort only those K elements with the network.

### 3. nth_element

Same partition primitive enables O(n) expected-time selection of the nth element — useful as a building block.

## Prior Art — Intel x86-simd-sort

Intel's open-source [x86-simd-sort](https://github.com/intel/x86-simd-sort) library demonstrates this approach at production quality:

- **10-17x faster** sorting than scalar in NumPy/PyTorch (integrated upstream)
- Hybrid quicksort: SIMD partition + bitonic sort for small arrays
- Supports partial sort, nth_element, and argsort
- Up to **15-20x** speedup for partial sort and nth_element on 32-bit data
- Available benchmarks and detailed analysis: [sort-research-rs writeup](https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md)

**Key paper:** Bramas, "A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512" ([arXiv:1704.08579](https://arxiv.org/pdf/1704.08579)) — 4x faster than GNU `std::sort`, 1.4x faster than Intel IPP.

## Implementation Considerations

This would expand the library's scope beyond fixed sorting networks:

**Current scope:** Source-generated fixed sorting networks for n=2..64, with user-supplied `OnFallback` for larger sizes. SIMD dispatch (AVX2/AVX-512/ARM NEON) is already emitted by the generator for supported types.

**How this could fit the source generator model:**
1. **Conservative** — The generator emits a built-in `OnFallback` implementation using SIMD partition that recursively breaks arrays down to network-sortable sizes (n <= 64). Users opt in via an attribute flag (e.g., `[SortingNetwork(64, typeof(int), Fallback = true)]`). Public API stays the same.
2. **Moderate** — New attributes for `PartialSort` / `NthElement` that emit additional generated methods alongside `Sort`.
3. **Ambitious** — Full SIMD hybrid sort for arbitrary n, competing with `Array.Sort()`.

## Recommendation

**Priority: Low (research/exploratory).** The .NET 10 APIs exist and the hardware is widely available. The most practical path is option 1 (conservative) — emitting an `OnFallback` that uses SIMD partition internally, keeping the fixed sorting networks as the base case for small sub-arrays. This fits naturally into the existing source generator architecture since it would be a new emitter alongside `ScalarEmitter`, `SimdX86Emitter`, and `SimdArmEmitter`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: AVX-512 VBMI2 compress/expand for hybrid sorting #34

Summary

Current Architecture (Source Generators)

.NET 10 API Status

Hardware Availability

Use Cases for SortingNetworks

1. SIMD Partition (highest value for this library)

2. Partial Sort / Top-K

3. nth_element

Prior Art — Intel x86-simd-sort

Implementation Considerations

Recommendation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ISA	Types	Max Elements
AVX2	`byte`, `sbyte`, `int`, `uint`, `float`	64 (1-byte), 64 (4-byte)
AVX-512F	`long`, `ulong`, `double`	32
AVX-512BW	`short`, `ushort`, `char`	64
AVX-512 VBMI	`byte`, `sbyte` (sizes > 32)	64
ARM AdvSimd (NEON)	`byte`, `sbyte`, `short`, `ushort`, `char`, `int`, `uint`, `float`	32

CPU	VBMI2	vpcompressb
Intel Ice Lake (2019+)	Yes	Yes
Intel Sapphire Rapids (2023+)	Yes	Yes
AMD Zen 4 / EPYC Genoa (2022+)	Yes	Yes
Intel Alder Lake (12th gen)	No	No
AMD Zen 3 and earlier	No	No

Research: AVX-512 VBMI2 compress/expand for hybrid sorting #34

Description

Summary

Current Architecture (Source Generators)

.NET 10 API Status

Hardware Availability

Use Cases for SortingNetworks

1. SIMD Partition (highest value for this library)

2. Partial Sort / Top-K

3. nth_element

Prior Art — Intel x86-simd-sort

Implementation Considerations

Recommendation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions