Summary
Explore ARM SVE/SVE2 (Scalable Vector Extension) for sorting networks, leveraging predicated operations and scalable vector lengths.
.NET 10 SVE/SVE2 API Status
SVE support is now available in .NET as [Experimental] APIs:
System.Runtime.Intrinsics.Arm.Sve — full SVE intrinsics, landed in .NET 9, refined in .NET 10
System.Runtime.Intrinsics.Arm.Sve2 — partial SVE2 APIs (non-streaming only) in .NET 10, full coverage expected in .NET 11+
- Runtime detection:
Sve.IsSupported / Sve2.IsSupported
- Key APIs for sorting networks are available:
Sve.Min / Sve.Max — predicated element-wise min/max
Sve.ConditionalSelect — mask-based element selection (blend)
- Predicate generation:
Sve.CreateWhileLessThan — naturally handles n=27 without padding/masking unused lanes
Tracking issues:
Blog post: Engineering the Scalable Vector Extension in .NET
SVE's Key Advantage for Sorting Networks
SVE's predication model is fundamentally different from AdvSimd/NEON and is well-suited for sorting:
- Natural handling of n=27:
Sve.CreateWhileLessThan(0, 27) creates a predicate mask that covers exactly 27 elements. No need to zero-pad the 28th element or worry about garbage in unused lanes.
- Vector-length agnostic (VLA): A single implementation works across all SVE vector widths (128-bit to 2048-bit). On wider hardware, more elements are processed per instruction automatically.
- Predicated min/max:
Sve.Min(predicate, a, b) — inactive lanes are untouched, eliminating the shuffle/blend complexity of the current AdvSimd path.
Hardware Landscape (Critical Context)
| Platform |
Core |
SVE/SVE2 |
Vector Width |
Status |
| Graviton 3 |
Neoverse V1 |
SVE |
256 bits |
Shipping (AWS, 2022+) |
| Graviton 4 |
Neoverse V2 |
SVE2 |
128 bits |
Shipping (AWS, 2024+) |
| Nvidia Grace |
Neoverse V2 |
SVE2 |
128 bits |
Shipping |
| Apple M1-M4 |
Apple custom |
No SVE |
NEON 128b only |
No SVE planned |
Key insight: Graviton 4 and Neoverse V2 implement SVE2 at only 128 bits — the same width as NEON. This means SVE offers no width advantage over the current AdvSimd path on the most common ARM64 server hardware shipping today. The main SVE width benefit is on Graviton 3 (256-bit) and future wider implementations.
Apple Silicon (the macos-latest CI runner) has no SVE support at all — SVE code would never run on macOS CI.
Implementation Approach
SVE sorting would look conceptually like:
if (Sve.IsSupported)
{
// Vector length determined at runtime — could be 128b, 256b, 512b, etc.
var pred = Sve.CreateWhileLessThan(0, n); // predicate for n elements
var vec = Sve.LoadVector(pred, ref first);
// Each network step: permute + predicated min/max + select
var shuffled = Sve.PermuteVariable(vec, shuffleIndices);
var mins = Sve.Min(pred, vec, shuffled);
var maxs = Sve.Max(pred, vec, shuffled);
vec = Sve.ConditionalSelect(blendMask, maxs, mins);
}
Challenges:
- The generator would need a new
WriteArmSveSortMethod path
- SVE shuffle/permute instructions differ from AdvSimd's TBL approach
- The VLA model means the number of elements per vector isn't known at compile time — the sorting network must be structured around this (though for n=27/28 with a minimum of 128-bit, we know we always have at least 16 byte-lanes)
- For 128-bit SVE (Graviton 4), performance would likely be similar to the existing AdvSimd path — the benefit comes from wider implementations
- Testing requires SVE-capable hardware (CI would need a Graviton instance or QEMU SVE emulation)
Research References
- Bramas 2021 — "A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE)" — PeerJ Computer Science. Demonstrates predicated bitonic sorting networks on SVE with up to 4x speedup. The most directly relevant academic work.
- Brank 2023 — Thesis on VLA SIMD parallelism with focus on ARM SVE.
Recommendation
Priority: Low-medium. The APIs exist in .NET 10 but are [Experimental]. The hardware picture is mixed — the most common ARM64 servers (Graviton 4, Neoverse V2) only have 128-bit SVE, offering no width advantage over the existing AdvSimd path. The main benefit would be:
- Code simplicity — predication eliminates the n=27 vs n=28 split and lane-crossing complexity
- Future-proofing — wider SVE implementations (512-bit+) would see automatic speedups
- Graviton 3 — the 256-bit SVE width would process more elements per instruction
Consider revisiting when .NET 11 stabilizes SVE2 APIs and wider SVE hardware (Neoverse V3+) ships.
Summary
Explore ARM SVE/SVE2 (Scalable Vector Extension) for sorting networks, leveraging predicated operations and scalable vector lengths.
.NET 10 SVE/SVE2 API Status
SVE support is now available in .NET as
[Experimental]APIs:System.Runtime.Intrinsics.Arm.Sve— full SVE intrinsics, landed in .NET 9, refined in .NET 10System.Runtime.Intrinsics.Arm.Sve2— partial SVE2 APIs (non-streaming only) in .NET 10, full coverage expected in .NET 11+Sve.IsSupported/Sve2.IsSupportedSve.Min/Sve.Max— predicated element-wise min/maxSve.ConditionalSelect— mask-based element selection (blend)Sve.CreateWhileLessThan— naturally handles n=27 without padding/masking unused lanesTracking issues:
Blog post: Engineering the Scalable Vector Extension in .NET
SVE's Key Advantage for Sorting Networks
SVE's predication model is fundamentally different from AdvSimd/NEON and is well-suited for sorting:
Sve.CreateWhileLessThan(0, 27)creates a predicate mask that covers exactly 27 elements. No need to zero-pad the 28th element or worry about garbage in unused lanes.Sve.Min(predicate, a, b)— inactive lanes are untouched, eliminating the shuffle/blend complexity of the current AdvSimd path.Hardware Landscape (Critical Context)
Key insight: Graviton 4 and Neoverse V2 implement SVE2 at only 128 bits — the same width as NEON. This means SVE offers no width advantage over the current AdvSimd path on the most common ARM64 server hardware shipping today. The main SVE width benefit is on Graviton 3 (256-bit) and future wider implementations.
Apple Silicon (the
macos-latestCI runner) has no SVE support at all — SVE code would never run on macOS CI.Implementation Approach
SVE sorting would look conceptually like:
Challenges:
WriteArmSveSortMethodpathResearch References
Recommendation
Priority: Low-medium. The APIs exist in .NET 10 but are
[Experimental]. The hardware picture is mixed — the most common ARM64 servers (Graviton 4, Neoverse V2) only have 128-bit SVE, offering no width advantage over the existing AdvSimd path. The main benefit would be:Consider revisiting when .NET 11 stabilizes SVE2 APIs and wider SVE hardware (Neoverse V3+) ships.