Skip to content

Commit 516505e

Browse files
committed
possible final set of additions/changes to post
1 parent 786dcbc commit 516505e

File tree

4 files changed

+50
-20
lines changed

4 files changed

+50
-20
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 48 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply Reconvergence?'
3-
slug: 'subgroup-shuffle-reconvergence-on-nvidia'
2+
title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply Execution Dependency?'
3+
slug: 'subgroup-shuffle-execution-dependency-on-nvidia'
44
description: "A look at the behavior behind Nabla's subgroup scan"
55
date: '2025-06-19'
66
authors: ['keptsecret', 'devshgraphicsprogramming']
@@ -13,7 +13,8 @@ last_update:
1313
Reduce and scan operations are core building blocks in the world of parallel computing, and now [Nabla has a new release](https://github.com/Devsh-Graphics-Programming/Nabla/tree/v0.6.2-alpha1) with those operations made even faster for Vulkan at the subgroup and workgroup levels.
1414

1515
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan.
16-
Then, I discuss a missing reconvergence behavior that was expected after subgroup shuffle operations that was only observed on Nvidia devices.
16+
17+
Then, I discuss a missing excution dependency expected for a subgroup shuffle operation, which was only a problem on Nvidia devices in some test cases.
1718

1819
<!-- truncate -->
1920

@@ -91,6 +92,7 @@ T inclusive_scan(T value)
9192
rhs = shuffleUp(value, 1)
9293
value = value + (firstInvocation ? identity : rhs)
9394

95+
[unroll]
9496
for (i = 1; i < SubgroupSizeLog2; i++)
9597
{
9698
nextLevelStep = 1 << i
@@ -110,6 +112,7 @@ You can find all the implementations on the [Nabla repository](https://github.co
110112
## An issue with subgroup sync and reconvergence
111113
112114
Now, onto a pretty significant, but strangely obscure, problem that I ran into during unit testing this prior to release.
115+
[See the unit tests.](https://github.com/Devsh-Graphics-Programming/Nabla-Examples-and-Tests/blob/master/23_Arithmetic2UnitTest/app_resources/testSubgroup.comp.hlsl)
113116
Nabla also has implementations for workgroup reduce and scans that make use of the subgroup scans above, and one such section looks like this.
114117
115118
```cpp
@@ -134,17 +137,17 @@ workgroup_execution_and_memory_barrier()
134137
... workgroup scan code ...
135138
```
136139

137-
_I should note that `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
140+
_I should note that this is the first level of scans for the workgroup scope. It is only one step of the algorithm and the data accesses are completely independent. Thus, `memoryIdx` is unique and per-invocation, and also that shared memory is only written to in this step to be accessed in later steps._
138141

139142
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
140-
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
143+
After some more testing and debugging to try and identify the cause, I've found the conditions to be:
141144

142145
* using an Nvidia GPU
143146
* using emulated versions of subgroup operations
144147
* a decent number of iterations in the loop (in this case at least 8).
145148

146149
I tested this on an Intel GPU, to be sure, and the workgroup scan ran correctly.
147-
That was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
150+
This was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
148151

149152
It was even more convincing when I moved the control barrier inside the loop and it immediately produced correct scan results.
150153

@@ -172,7 +175,8 @@ for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
172175
173176
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
174177
Particularly, the effect we're seeing is a shuffle done as if `value` is not in lockstep at the call site.
175-
We tested using a subgroup execution barrier and maximal reconvergence, and found out a memory barrier is enough.
178+
We tested using a subgroup execution barrier and maximal reconvergence.
179+
Strangely enough, just a memory barrier also fixed it, which it shouldn't have as subgroup shuffles are magical intrinsics that take arguments by copy and don't really deal with accessing any memory locations (SSA form).
176180
177181
```cpp
178182
T inclusive_scan(T value)
@@ -182,9 +186,11 @@ T inclusive_scan(T value)
182186
rhs = shuffleUp(value, 1)
183187
value = value + (firstInvocation ? identity : rhs)
184188
189+
[unroll]
185190
for (i = 1; i < SubgroupSizeLog2; i++)
186191
{
187192
nextLevelStep = 1 << i
193+
memory_barrier()
188194
rhs = shuffleUp(value, nextLevelStep)
189195
value = value + (nextLevelStep out of bounds ? identity : rhs)
190196
}
@@ -196,24 +202,29 @@ However, this problem was only observed on Nvidia devices.
196202

197203
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
198204
I feel I should point out that many presentations and code listings seem to give an impression subgroup shuffle operations execute in lockstep based on the very simple examples provided.
205+
199206
For instance, [the example in this presentation](https://vulkan.org/user/pages/09.events/vulkanised-2025/T08-Hugo-Devillers-SaarlandUniversity.pdf) correctly demonstrates where invocations in a tangle are reading and storing to SSBO, but may mislead readers into not considering the Availability and Visibility for other scenarios that need it.
200-
Specifically, it does not have an intended read-after write if invocations in a tangle execute in lockstep.
207+
208+
Such simple examples are good enough to demonstrate the purpose of the extension, but fail to elaborate on specific details.
209+
If it did have a read-after-write between subgroup invocations, subgroup scope memory dependencies would have been needed.
210+
201211
(With that said, since subgroup operations are SSA and take arguments "by copy", this discussion of Memory Dependencies and availability-visibility is not relevant to our problem, but just something to be aware of.)
202212

203213
### A minor detour onto the performance of native vs. emulated on Nvidia devices
204214

215+
Since all recent Nvidia GPUs support subgroup arithmetic SPIR-V capability, why were we using emulation with shuffles?
205216
I think this observation warrants a small discussion section of its own.
206217
The table below are some numbers from our benchmark measured through Nvidia's Nsight Graphics profiler of a subgroup inclusive scan using native SPIR-V instructions and our emulated version.
207218

208-
_Native_
219+
#### Native
209220

210221
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
211222
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
212223
| 256 | 41.6 | 90.5 | 16 | 27 |
213224
| 512 | 41.4 | 89.7 | 16 | 27.15 |
214225
| 1024 | 40.5 | 59.7 | 16 | 27.74 |
215226

216-
_Emulated_
227+
#### Emulated
217228

218229
| Workgroup size | SM throughput (%) | CS warp occupancy (%) | # registers | Dispatch time (ms) |
219230
| :------------: | :---------------: | :-------------------: | :---------: | :----------------: |
@@ -222,11 +233,11 @@ _Emulated_
222233
| 1024 | 37.1 | 60.5 | 16 | 12.47 |
223234

224235
These numbers are baffling to say the least, particularly the fact that our emulated subgroup scans are twice as fast than the native solution.
225-
It should be noted that this is with the subgroup barrier in place, not that we saw any marked decrease in performance compared to earlier versions without it.
236+
It should be noted that this is with the subgroup barrier before every shuffle, we did not see any marked decrease in performance.
226237

227238
An potential explanation for this may be that Nvidia has to consider any inactive invocations in a subgroup, having them behave as if they contribute the identity $I$ element to the scan.
228239
Our emulated scan instead requires people call the arithmetic in subgroup uniform fashion.
229-
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V compiler.
240+
If that is not the case, this seems like a cause for concern for Nvidia's SPIR-V to SASS compiler.
230241

231242
### What could cause this behavior on Nvidia? — The Independent Program Counter
232243

@@ -236,25 +247,39 @@ Prior to Volta, all threads in a subgroup share the same program counter, which
236247
This means all threads in the same subgroup execute the same instruction at any given time.
237248
Therefore, when you have a branch in the program flow across threads in the same subgroup, all execution paths generally have to be executed and mask off threads that should not be active for that path.
238249

250+
<figure class="image">
251+
![Pascal and prior SIMT model](pascal_simt_model.png "Pascal and prior SIMT model")
252+
<figcaption>Thread scheduling under the SIMT warp execution model of Pascal and earlier NVIDIA GPUs. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
253+
</figure>
254+
239255
With Volta up to now, each thread has its own program counter that allows it to execute independently of other threads in the same subgroup.
240256
This also provides a new possibility on Nvidia devices, where you can now synchronize threads in the same subgroup.
257+
The active invocations still have to execute the same instruction, but it can be at different locations in the program (e.g. different iterations of a loop).
258+
259+
<figure class="image">
260+
![Volta Independent Thread Scheduling model](volta_scheduling_model.png "Volta Independent Thread Scheduling model")
261+
<figcaption>Independent thread scheduling in Volta architecture onwards interleaving execution from divergent branches, using an explicit sync to reconverge threads. Taken from [NVIDIA TESLA V100 GPU ARCHITECTURE](https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf)</figcaption>
262+
</figure>
263+
241264
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
242-
It's entirely possible that each subgroup shuffle operation does not run in lockstep, with the branching introduced in the loop, which would be why that is our solution to the problem for now.
265+
It's entirely possible that each subgroup shuffle operation does not run in lockstep with the branching introduced, which would be why that is our solution to the problem for now.
266+
267+
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed whether subgroup shuffle operations actually imply execution dependency, even with hours of scouring the spec.
243268

244-
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations actually do not imply reconvergence in the Vulkan specification.
245-
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed this, even with hours of scouring the spec.
269+
So then we either have...
246270

247-
## What does this implication mean for subgroup operations?
271+
## This is a gray area of the Subgroup Shuffle Spec and allowed Undefined Behaviour
248272

249273
Consider what it means if subgroup convergence doesn't guarantee that active tangle invocations execute a subgroup operation in lockstep.
250274

251-
Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle and it is known exactly what it should be.
275+
Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle, and as a corollary, it is known exactly what it should be.
276+
252277
Similarly, for subgroup broadcasts, first the value being broadcast needs to computed, say from invocation K.
253278
Even if other invocations don't run in lockstep, they can't read the value until invocation K broadcasts it if they want to read the same value (uniformity) and you know what value should be read (broadcasting invocation can check it got the same value back).
254279

255280
On the flip side, reductions will always produce a uniform return value for all invocations, even if you reduce a stale or out-of-lockstep input value.
256281

257-
Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution and memory dependency.
282+
Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution dependency.
258283
These operations can give different results per invocation so there's no implied uniformity, which means there's no reason to expect any constraints on their apparent lockstepness being implied transitively through the properties of the return value.
259284

260285
The important consideration then is how a subgroup operation is implemented.
@@ -302,5 +327,10 @@ if (subgroupAny(needs_space)) {
302327

303328
With all that said, it needs to be noted that one can't expect every instruction to run in lockstep, as that would negate the advantages of Nvidia's IPC.
304329

330+
## Or a bug in Nvidia's SPIR-V to SASS compiler
331+
332+
And crucially, it's impossible to know (or discuss in the case of a signed NDA) what's happening for the bug or performance regression with Nvidia.
333+
Unlike AMD's RDNA ISAs where we can verify that the compiler is doing what it should be doing using Radeon GPU Analyzer, the generated SASS is inaccessible and neither is the compiler public.
334+
305335
----------------------------
306336
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._
66.4 KB
Loading
74.5 KB
Loading

blog/authors.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,8 @@ keptsecret:
4545

4646
devshgraphicsprogramming:
4747
name: Mateusz Kielan
48-
title: CTO of DevSH GP
49-
url: https://github.com/devshgraphicsprogramming
48+
title: CTO of DevSH Graphics Programming Sp. z O.O.
49+
url: https://www.devsh.eu/
5050
image_url: https://avatars.githubusercontent.com/u/6894321?v=4
5151
page: true
5252
socials:

0 commit comments

Comments
 (0)