|
| 1 | +--- |
| 2 | +title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply Reconvergence?' |
| 3 | +slug: 'subgroup-shuffle-reconvergence-on-nvidia' |
| 4 | +description: "A look at the behavior behind Nabla's subgroup scan" |
| 5 | +date: '2025-06-19' |
| 6 | +authors: ['keptsecret'] |
| 7 | +tags: ['nabla', 'vulkan', 'article'] |
| 8 | +last_update: |
| 9 | + date: '2025-06-19' |
| 10 | + author: keptsecret |
| 11 | +--- |
| 12 | + |
| 13 | +Reduce and scan operations are core building blocks in the world of parallel computing, and now Nabla has a new release with those operations made even faster for Vulkan at the subgroup and workgroup levels. |
| 14 | + |
| 15 | +This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan, and then a discussion on expected reconvergence behavior after subgroup operations. |
| 16 | + |
| 17 | +<!-- truncate --> |
| 18 | + |
| 19 | +## Reduce and Scan |
| 20 | + |
| 21 | +Let's give a quick introduction, or recap for those already familiar, to reduce and scan operations. |
| 22 | + |
| 23 | +A reduction takes a binary associative operator $\bigoplus$ and an array of $n$ elements |
| 24 | + |
| 25 | +$\left[x_0, x_1,...,x_{n-1}\right]$, |
| 26 | + |
| 27 | +and returns |
| 28 | + |
| 29 | +$x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1}$. |
| 30 | + |
| 31 | +In other words, when $\bigoplus$ is an addition, a reduction of the array $X$ is then the sum of all elements of array $X$. |
| 32 | + |
| 33 | +``` |
| 34 | +Input: 4 6 2 3 7 1 0 5 |
| 35 | +Reduction: 28 |
| 36 | +``` |
| 37 | + |
| 38 | +A scan is a generalization of reduction, and takes a binary associative operator $\bigoplus$ with identity $I$ and an array of $n$ elements. |
| 39 | +Then, for each element, performs the reduction from the first element to the current element. |
| 40 | +An _exclusive_ scan does so for all elements before the current element. |
| 41 | + |
| 42 | +$\left[I, x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-2})\right]$. |
| 43 | + |
| 44 | +An _inclusive_ scan then includes the current element as well. |
| 45 | + |
| 46 | +$\left[x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1})\right]$. |
| 47 | + |
| 48 | +Notice the last element of the inclusive scan is the same as the reduction. |
| 49 | + |
| 50 | +``` |
| 51 | +Input: 4 6 2 3 7 1 0 5 |
| 52 | +Exclusive: 0 4 10 12 15 22 23 23 |
| 53 | +Inclusive: 4 10 12 15 22 23 23 28 |
| 54 | +``` |
| 55 | + |
| 56 | +## Nabla's subgroup scans |
| 57 | + |
| 58 | +We start with the most basic of building blocks: doing a reduction or a scan in the local subgroup of a Vulkan device. |
| 59 | +Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported via SPIRV, and it's all available in Nabla. |
| 60 | + |
| 61 | +```cpp |
| 62 | +nbl::hlsl::glsl::groupAdd(T value) |
| 63 | +nbl::hlsl::glsl::groupInclusiveAdd(T value) |
| 64 | +nbl::hlsl::glsl::groupExclusiveAdd(T value) |
| 65 | +etc... |
| 66 | +``` |
| 67 | +
|
| 68 | +But wait, the SPIRV-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability. |
| 69 | +So, Nabla provides emulated versions for that too, and it's all compiled into a single templated struct call. |
| 70 | +
|
| 71 | +```cpp |
| 72 | +template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native> |
| 73 | +struct inclusive_scan; |
| 74 | +
|
| 75 | +template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native> |
| 76 | +struct exclusive_scan; |
| 77 | +
|
| 78 | +template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native> |
| 79 | +struct reduction; |
| 80 | +``` |
| 81 | + |
| 82 | +The implementation of emulated subgroup scans make use of subgroup shuffle operations to access partial sums from other invocations in the subgroup. |
| 83 | + |
| 84 | +```cpp |
| 85 | +T inclusive_scan(T value) |
| 86 | +{ |
| 87 | + rhs = shuffleUp(value, 1) |
| 88 | + value = value + (firstInvocation ? identity : rhs) |
| 89 | + |
| 90 | + for (i = 1; i < SubgroupSizeLog2; i++) |
| 91 | + { |
| 92 | + nextLevelStep = 1 << i |
| 93 | + rhs = shuffleUp(value, nextLevelStep) |
| 94 | + value = value + (nextLevelStep out of bounds ? identity : rhs) |
| 95 | + } |
| 96 | + return value |
| 97 | +} |
| 98 | +``` |
| 99 | +
|
| 100 | +In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call. |
| 101 | +Note that it expects the elements in the vectors to be consecutive and in the same order as the input array. |
| 102 | +
|
| 103 | +You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl) |
| 104 | +
|
| 105 | +## An issue with subgroup sync and reconvergence |
| 106 | +
|
| 107 | +Now, onto a pretty significant, but strangely obscure, problem that I ran into during unit testing this prior to release. |
| 108 | +Nabla also has implementations for workgroup reduce and scans that make use of the subgroup scans above, and one such section looks like this. |
| 109 | +
|
| 110 | +```cpp |
| 111 | +... workgroup scan code ... |
| 112 | +
|
| 113 | +for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++) |
| 114 | +{ |
| 115 | + value = getValueFromDataAccessor(memoryIdx) |
| 116 | +
|
| 117 | + value = subgroup::inclusive_scan(value) |
| 118 | +
|
| 119 | + setValueToDataAccessor(memoryIdx) |
| 120 | +
|
| 121 | + if (lastSubgroupInvocation) |
| 122 | + { |
| 123 | + setValueToSharedMemory(smemIdx) |
| 124 | + } |
| 125 | +} |
| 126 | +control_barrier() |
| 127 | +
|
| 128 | +... workgroup scan code ... |
| 129 | +``` |
| 130 | + |
| 131 | +At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases. |
| 132 | +And from some more testing and debugging to try and identify the cause, I've found the conditions to be: |
| 133 | + |
| 134 | +* using an Nvidia GPU |
| 135 | +* using emulated versions of subgroup operations |
| 136 | +* a decent number of iterations in the loop (in this case at least 8). |
| 137 | + |
| 138 | +I tested this on an Intel GPU, to be sure, and the workgroup scan ran correctly. |
| 139 | +That was very baffling initially. And the results produced on an Nvidia device looked like a sync problem. |
| 140 | + |
| 141 | +It was even more convincing when I moved the control barrier inside the loop and it immediately produced correct scan results. |
| 142 | + |
| 143 | +```cpp |
| 144 | +... workgroup scan code ... |
| 145 | + |
| 146 | +for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++) |
| 147 | +{ |
| 148 | + value = getValueFromDataAccessor(memoryIdx) |
| 149 | + |
| 150 | + value = subgroup::inclusive_scan(value) |
| 151 | + |
| 152 | + setValueToDataAccessor(memoryIdx) |
| 153 | + |
| 154 | + if (lastSubgroupInvocation) |
| 155 | + { |
| 156 | + setValueToSharedMemory(smemIdx) |
| 157 | + } |
| 158 | + control_barrier() |
| 159 | +} |
| 160 | + |
| 161 | +... workgroup scan code ... |
| 162 | +``` |
| 163 | + |
| 164 | +Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on. |
| 165 | +Particularly, the last invocation that spends some extra time writing to shared memory may have been lagging behind. |
| 166 | +It is a simple fix to the emulated subgroup reduce and scan. A memory barrier was enough. |
| 167 | + |
| 168 | +```cpp |
| 169 | +T inclusive_scan(T value) |
| 170 | +{ |
| 171 | + memory_barrier() |
| 172 | + |
| 173 | + rhs = shuffleUp(value, 1) |
| 174 | + value = value + (firstInvocation ? identity : rhs) |
| 175 | + |
| 176 | + for (i = 1; i < SubgroupSizeLog2; i++) |
| 177 | + { |
| 178 | + nextLevelStep = 1 << i |
| 179 | + rhs = shuffleUp(value, nextLevelStep) |
| 180 | + value = value + (nextLevelStep out of bounds ? identity : rhs) |
| 181 | + } |
| 182 | + return value |
| 183 | +} |
| 184 | +``` |
| 185 | +
|
| 186 | +As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly. |
| 187 | +
|
| 188 | +However, this was only a problem on Nvidia devices. |
| 189 | +And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the spec. |
| 190 | +
|
| 191 | +------------------- |
| 192 | +
|
| 193 | +P.S. you may note in the source code that the memory barrier contains the workgroup memory mask, despite us only needing sync in the subgroup scope. |
| 194 | +
|
| 195 | +```cpp |
| 196 | +spirv::memoryBarrier(spv::ScopeSubgroup, spv::MemorySemanticsWorkgroupMemoryMask | spv::MemorySemanticsAcquireMask); |
| 197 | +``` |
| 198 | + |
| 199 | +This is because unfortunately, the subgroup memory mask doesn't seem to count as a storage class, at least according to the Vulkan SPIRV validator. |
| 200 | +Only the next step up in memory level is valid. |
| 201 | +I feel like there's possibly something missing here. |
0 commit comments