Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/instructions/performance.instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ This is a high-performance sorting library. All code in the hot path must follow
- Use `MemoryMarshal.GetReference(span)` to get a ref to the first element.
- Mark hot-path private methods with `[MethodImpl(MethodImplOptions.AggressiveInlining)]`.
- Avoid heap allocations in sort methods — no LINQ, no closures, no boxing.
- Use inline compare-and-swap (`if (a > b) { T temp = a; a = b; b = temp; }`) for primitive types rather than `IComparer<T>`.
- Use platform-specific compare-and-swap for numeric primitive types (`byte`, `sbyte`, `short`, `ushort`, `int`, `uint`, `long`, `ulong`, `float`, `double`). By default, the generator emits a runtime `X86Base.IsSupported` check: branchless `Math.Min`/`Math.Max` on x86 (JIT lowers to `cmov`), branching `if/swap` on ARM (where branch prediction outperforms `csel` chains). The `Branchless` attribute property can force one strategy: `[SortingNetwork(27, typeof(int), Branchless = true)]`. For `char` and custom types, always use branching `if (a > b) { T temp = a; a = b; b = temp; }`.
- **NaN is not supported** for `float`/`double` sorting. Sorting networks use ordered comparisons where `NaN > x` is always false, so NaN values disrupt sort order. See [#10](https://github.com/jonathanpeppers/SortingNetworks/issues/10) and [#11](https://github.com/jonathanpeppers/SortingNetworks/issues/11).
- `IComparer<T>` overloads use the loop-based `ApplyNetworkWithComparer` path and are not unrolled.
- Fallback to `span.Sort()` or `Array.Sort()` for sizes outside the network range.
Expand Down
82 changes: 61 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ MySorter.Sort(otherData); // any other size → OnFallback
the comparer path still throws.

The source generator emits optimized sort methods with:
- **Scalar unrolled** compare-and-swap for all sizes/types
- **Scalar unrolled** compare-and-swap for all sizes/types (platform-adaptive: branchless `Math.Min`/`Math.Max` on x86, branching `if/swap` on ARM)
- **x86 SIMD** (AVX2, AVX-512) when the type and size fit in SIMD registers
- **ARM64 SIMD** (AdvSimd/NEON) for supported types
- **IComparer&lt;T&gt;** overloads using loop-based network application
Expand Down Expand Up @@ -171,27 +171,45 @@ comparators that involve channel 27.
### Scalar implementation

The simplest path unrolls every compare-and-swap from the network into
straight-line code. For a 3-element example, a depth-3 network looks like:
straight-line code. For numeric types, the generator emits a runtime platform
check: on x86, branchless `Math.Min`/`Math.Max` calls are used (the JIT lowers
these to `cmov` instructions); on ARM, branching `if/swap` is used (branch
prediction outperforms `csel` data-dependency chains). The JIT
dead-code-eliminates the unused path. For a 3-element example on x86:

```csharp
// Sort 3 elements with a sorting network (depth 3, 3 comparators)
static void Sort3(ref int e0, ref int e1, ref int e2)
{
// Layer 1 — two independent comparators could go here, but
// for 3 elements there is only one pair per layer.
if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }
// Layer 1
{ int t0 = Math.Min(e0, e1); int t1 = Math.Max(e0, e1); e0 = t0; e1 = t1; }

// Layer 2
if (e1 > e2) { int t = e1; e1 = e2; e2 = t; }
{ int t0 = Math.Min(e1, e2); int t1 = Math.Max(e1, e2); e1 = t0; e2 = t1; }

// Layer 3
if (e0 > e1) { int t = e0; e0 = e1; e1 = t; }
{ int t0 = Math.Min(e0, e1); int t1 = Math.Max(e0, e1); e0 = t0; e1 = t1; }
}
```

For char and custom types, branching `if (a > b) swap` is used instead
(char lacks `Math.Min`/`Math.Max` overloads). The swap strategy can also be
controlled explicitly with the `Branchless` attribute property:

```csharp
// Force branchless Math.Min/Max on all platforms
[SortingNetwork(27, typeof(int), Branchless = true)]

// Force branching if/swap on all platforms
[SortingNetwork(27, typeof(int), Branchless = false)]

// Default: auto-detect at runtime (recommended)
[SortingNetwork(27, typeof(int))]
```

For the real 27/28-element networks the same pattern is used — the code
generator emits all ~185 comparators across 13 layers as a flat `if`/swap
sequence. Elements are loaded into local variables via `Unsafe.Add(ref T, n)`
generator emits all ~185 comparators across 13 layers as a flat sequence.
Elements are loaded into local variables via `Unsafe.Add(ref T, n)`
to avoid bounds checks:

```csharp
Expand All @@ -200,10 +218,10 @@ int e0 = first;
int e1 = Unsafe.Add(ref first, 1);
// ... load e2 through e26 ...

// Layer 1 comparators:
if (e1 > e26) { int temp = e1; e1 = e26; e26 = temp; }
if (e2 > e25) { int temp = e2; e2 = e25; e25 = temp; }
if (e3 > e24) { int temp = e3; e3 = e24; e24 = temp; }
// Layer 1 comparators (branchless Math.Min/Max for integer types):
{ int t0 = Math.Min(e1, e26); int t1 = Math.Max(e1, e26); e1 = t0; e26 = t1; }
{ int t0 = Math.Min(e2, e25); int t1 = Math.Max(e2, e25); e2 = t0; e25 = t1; }
{ int t0 = Math.Min(e3, e24); int t1 = Math.Max(e3, e24); e3 = t0; e24 = t1; }
// ... remaining comparators in layers 2–13 ...
```

Expand Down Expand Up @@ -575,17 +593,39 @@ multi-stage TBL overhead exceeds SIMD benefit at these sizes:

| Size | Kind | GeneratedSort | Ratio vs ArraySort |
|---|---|---|---|
| 27 | Random | 74 ns | **0.74x** (26% faster) |
| 27 | Sorted | 30 ns | **0.52x** (48% faster) |
| 27 | Reversed | 78 ns | 1.22x |
| 27 | Duplicates | 77 ns | **0.72x** (28% faster) |
| 28 | Random | 80 ns | **0.65x** (35% faster) |
| 28 | Sorted | 30 ns | **0.51x** (49% faster) |
| 28 | Reversed | 73 ns | 1.15x |
| 28 | Duplicates | 80 ns | **0.73x** (27% faster) |
| 27 | Random | 74 ns | **0.76x** (24% faster) |
| 27 | Sorted | 28 ns | **0.52x** (48% faster) |
| 27 | Reversed | 73 ns | 1.21x |
| 27 | Duplicates | 77 ns | **0.77x** (23% faster) |
| 28 | Random | 72 ns | **0.60x** (40% faster) |
| 28 | Sorted | 30 ns | **0.52x** (48% faster) |
| 28 | Reversed | 72 ns | 1.17x |
| 28 | Duplicates | 71 ns | **0.68x** (32% faster) |

> With AVX2 SIMD, GeneratedSort is consistently faster than Array.Sort for `int` across all input patterns. On ARM64, the early-exit sorted check makes sorted input ~2x faster than ArraySort. Reversed input is slightly slower due to the overhead of cross-vector TBL/TBX shuffles with 7 registers.

### int scalar sizes 23-32 (platform-adaptive)

For sizes outside the SIMD range, the scalar unrolled network uses a runtime
`X86Base.IsSupported` check: branchless `Math.Min`/`Math.Max` on x86 (JIT
lowers to `cmov`), branching `if/swap` on ARM (where branch prediction
outperforms `csel` data-dependency chains):

| Size | Ubuntu x64 (EPYC) | Windows x64 (EPYC) | macOS ARM (M1) |
|---|---|---|---|
| 23 | **0.82x** (18% faster) | **0.68x** (32% faster) | **0.75x** (25% faster) |
| 24 | **0.73x** (27% faster) | 1.02x | **0.77x** (23% faster) |
| 25 | **0.87x** (13% faster) | **0.85x** (15% faster) | **0.75x** (25% faster) |
| 26 | **0.81x** (19% faster) | **0.84x** (16% faster) | **0.76x** (24% faster) |
| 29 | **0.95x** | — | — |
| 30 | **0.93x** | — | — |
| 31 | **0.75x** (25% faster) | — | — |
| 32 | **0.78x** (22% faster) | — | — |

> **Note:** The `Branchless` attribute property can force one strategy on all
> platforms: `[SortingNetwork(27, typeof(int), Branchless = true)]` for
> branchless-only, `Branchless = false` for branching-only.

### Sizes 33-64 (x86, scalar unrolled)

Networks for sizes 33-64 use best-known networks from [Dobbelaere's SorterHunter](https://github.com/bertdobbelaere/SorterHunter).
Expand Down
95 changes: 80 additions & 15 deletions SortingNetworks.Generators/ScalarEmitter.cs
Original file line number Diff line number Diff line change
@@ -1,18 +1,46 @@
using System.Text;
using Microsoft.CodeAnalysis;

namespace SortingNetworks.Generators
{
/// <summary>
/// Emits unrolled scalar compare-and-swap sorting network code.
/// Emits unrolled scalar sorting code for a given network size and element type.
/// uses ref + Unsafe.Add for element access, inline compare-and-swap.
/// Uses ref + Unsafe.Add for element access. For numeric types with
/// System.Math.Min/Max overloads:
/// - Default (branchless=null): emits a runtime X86Base.IsSupported check with
/// branchless min/max on x86 and branching if/swap elsewhere.
/// - Branchless=true: emits only Math.Min/Max swaps.
/// - Branchless=false: emits only branching if/swap.
/// For char, string, and custom types, always emits branching if/swap.
/// </summary>
internal static class ScalarEmitter
{
/// <summary>
/// Returns true if the given <see cref="SpecialType"/> has direct
/// System.Math.Min/Max overloads suitable for branchless emission.
/// Excludes char/nint/nuint (no Math.Min/Max overloads).
/// Float/double are included — NaN is unsupported (see issues #10, #11).
/// </summary>
internal static bool SupportsBranchlessMinMax(SpecialType specialType) => specialType switch
{
SpecialType.System_Byte => true,
SpecialType.System_SByte => true,
SpecialType.System_Int16 => true,
SpecialType.System_UInt16 => true,
SpecialType.System_Int32 => true,
SpecialType.System_UInt32 => true,
SpecialType.System_Int64 => true,
SpecialType.System_UInt64 => true,
SpecialType.System_Single => true,
SpecialType.System_Double => true,
_ => false,
};

Comment thread
jonathanpeppers marked this conversation as resolved.
/// <summary>
/// Emits a scalar Sort method for the given network size and element type.
/// <paramref name="branchless"/>: null = auto (runtime platform check), true = force branchless, false = force branching.
/// </summary>
internal static string EmitSortMethod(int size, string typeName, int[] network, bool useCompareTo = false)
internal static string EmitSortMethod(int size, string typeName, SpecialType specialType, int[] network, bool useCompareTo = false, bool? branchless = null)
{
var sb = new StringBuilder();
sb.AppendLine($" private static void Sort{size}(ref {typeName} first)");
Expand All @@ -26,20 +54,33 @@ internal static string EmitSortMethod(int size, string typeName, int[] network,
}
sb.AppendLine();

// Emit compare-and-swap for each pair
// Determine swap strategy
bool isString = typeName == "string";
for (int i = 0; i < network.Length; i += 2)
bool canUseMathMinMax = !useCompareTo && !isString && SupportsBranchlessMinMax(specialType);

if (canUseMathMinMax && branchless == true)
{
int a = network[i];
int b = network[i + 1];
string condition;
if (useCompareTo)
condition = $"e{a}.CompareTo(e{b}) > 0";
else if (isString)
condition = $"string.CompareOrdinal(e{a}, e{b}) > 0";
else
condition = $"e{a} > e{b}";
sb.AppendLine($" if ({condition}) {{ {typeName} temp = e{a}; e{a} = e{b}; e{b} = temp; }}");
// Force branchless: Math.Min/Max only
EmitComparators(sb, network, typeName, branchless: true, indent: " ");
}
else if (canUseMathMinMax && branchless != false)
{
// Auto-detect (branchless == null): emit runtime platform check
// The JIT treats X86Base.IsSupported as a constant and dead-code-eliminates the unused branch
sb.AppendLine(" if (System.Runtime.Intrinsics.X86.X86Base.IsSupported)");
sb.AppendLine(" {");
EmitComparators(sb, network, typeName, branchless: true, indent: " ");
sb.AppendLine(" }");
sb.AppendLine(" else");
sb.AppendLine(" {");
EmitComparators(sb, network, typeName, branchless: false, indent: " ");
sb.AppendLine(" }");
}
else
{
// Force branching, or type doesn't support Math.Min/Max
EmitComparators(sb, network, typeName, branchless: false, indent: " ",
useCompareTo: useCompareTo, isString: isString);
}
sb.AppendLine();

Expand All @@ -54,6 +95,30 @@ internal static string EmitSortMethod(int size, string typeName, int[] network,
return sb.ToString();
}

private static void EmitComparators(StringBuilder sb, int[] network, string typeName, bool branchless, string indent, bool useCompareTo = false, bool isString = false)
{
for (int i = 0; i < network.Length; i += 2)
{
int a = network[i];
int b = network[i + 1];
if (branchless)
{
sb.AppendLine($"{indent}{{ {typeName} t0 = System.Math.Min(e{a}, e{b}); {typeName} t1 = System.Math.Max(e{a}, e{b}); e{a} = t0; e{b} = t1; }}");
}
else
{
string condition;
if (useCompareTo)
condition = $"e{a}.CompareTo(e{b}) > 0";
else if (isString)
condition = $"string.CompareOrdinal(e{a}, e{b}) > 0";
else
condition = $"e{a} > e{b}";
sb.AppendLine($"{indent}if ({condition}) {{ {typeName} temp = e{a}; e{a} = e{b}; e{b} = temp; }}");
}
}
}

/// <summary>
/// Emits a static readonly int[] field containing the network comparator pairs for a given size.
/// </summary>
Expand Down
27 changes: 22 additions & 5 deletions SortingNetworks.Generators/SortingNetworkGenerator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,20 @@ public class SortingNetworkGenerator : IIncrementalGenerator
_ => null,
};

/// <summary>
/// Reads the optional Branchless named argument from an attribute.
/// Returns null if not specified (auto-detect), true/false if explicitly set.
/// </summary>
private static bool? GetBranchlessArg(AttributeData attr)
{
foreach (var arg in attr.NamedArguments)
{
if (arg.Key == "Branchless" && arg.Value.Value is bool value)
return value;
}
return null;
}

public void Initialize(IncrementalGeneratorInitializationContext context)
{
// Find all class declarations with [SortingNetwork] attributes
Expand Down Expand Up @@ -108,7 +122,7 @@ public void Initialize(IncrementalGeneratorInitializationContext context)
}
}

attributes.Add(new NetworkRequest(size, typeName, typeSymbol.SpecialType, isComparable));
attributes.Add(new NetworkRequest(size, typeName, typeSymbol.SpecialType, isComparable, branchless: GetBranchlessArg(attr)));
}
}

Expand Down Expand Up @@ -862,7 +876,7 @@ private static void Execute(SourceProductionContext context, ImmutableArray<Gene
var scalarKey = $"Sort{request.Size}_{delegateTypeName}";
if (emittedScalarMethods.Add(scalarKey))
{
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, delegateTypeName, network));
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, delegateTypeName, delegateSpecialType, network, branchless: request.Branchless));
sb.AppendLine();
}
}
Expand All @@ -872,7 +886,7 @@ private static void Execute(SourceProductionContext context, ImmutableArray<Gene
var scalarKey = $"Sort{request.Size}_{request.TypeName}";
if (emittedScalarMethods.Add(scalarKey))
{
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, request.TypeName, network));
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, request.TypeName, request.SpecialType, network, branchless: request.Branchless));
sb.AppendLine();
}
}
Expand All @@ -881,7 +895,7 @@ private static void Execute(SourceProductionContext context, ImmutableArray<Gene
var scalarKey = $"Sort{request.Size}_{request.TypeName}";
if (emittedScalarMethods.Add(scalarKey))
{
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, request.TypeName, network, useCompareTo: true));
sb.Append(ScalarEmitter.EmitSortMethod(request.Size, request.TypeName, request.SpecialType, network, useCompareTo: true));
sb.AppendLine();
}
}
Expand Down Expand Up @@ -1033,14 +1047,17 @@ private sealed class NetworkRequest : IEquatable<NetworkRequest>
public SpecialType SpecialType { get; }
public bool IsCustomType { get; }
public bool IsComparable { get; }
/// <summary>null = auto (platform-detect), true = force branchless, false = force branching</summary>
public bool? Branchless { get; }

public NetworkRequest(int size, string typeName, SpecialType specialType, bool isComparable)
public NetworkRequest(int size, string typeName, SpecialType specialType, bool isComparable, bool? branchless = null)
{
Size = size;
TypeName = typeName;
SpecialType = specialType;
IsCustomType = !SupportedSpecialTypes.Contains(specialType);
IsComparable = isComparable;
Branchless = branchless;
}

public bool Equals(NetworkRequest? other)
Expand Down
Loading
Loading