Skip to content

Commit 786dcbc

Browse files
committed
last section of blog post
1 parent b29b0c8 commit 786dcbc

File tree

1 file changed

+60
-1
lines changed
  • blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia

1 file changed

+60
-1
lines changed

blog/2025/2025-06-19-subgroup-shuffle-reconvergence-on-nvidia/index.md

Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,7 +241,66 @@ This also provides a new possibility on Nvidia devices, where you can now synchr
241241
In CUDA, this is exposed through `__syncwarp()`, and we can do similar in Vulkan using subgroup control barriers.
242242
It's entirely possible that each subgroup shuffle operation does not run in lockstep, with the branching introduced in the loop, which would be why that is our solution to the problem for now.
243243

244-
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations just do not imply reconvergence in the Vulkan specification.
244+
In the end, it's unclear whether this is a bug in Nvidia's SPIR-V compiler or subgroup shuffle operations actually do not imply reconvergence in the Vulkan specification.
245+
Unfortunately, I couldn't find anything explicit mention in the SPIR-V specification that confirmed this, even with hours of scouring the spec.
246+
247+
## What does this implication mean for subgroup operations?
248+
249+
Consider what it means if subgroup convergence doesn't guarantee that active tangle invocations execute a subgroup operation in lockstep.
250+
251+
Subgroup ballot and ballot arithmetic are two where you don't have to consider lockstepness, because it is expected that the return value of ballot to be uniform in a tangle and it is known exactly what it should be.
252+
Similarly, for subgroup broadcasts, first the value being broadcast needs to computed, say from invocation K.
253+
Even if other invocations don't run in lockstep, they can't read the value until invocation K broadcasts it if they want to read the same value (uniformity) and you know what value should be read (broadcasting invocation can check it got the same value back).
254+
255+
On the flip side, reductions will always produce a uniform return value for all invocations, even if you reduce a stale or out-of-lockstep input value.
256+
257+
Meanwhile, subgroup operations that don't return tangle-uniform values, such as shuffles and scans, would only produce the expected result only if performed on constants or variables written with an execution and memory dependency.
258+
These operations can give different results per invocation so there's no implied uniformity, which means there's no reason to expect any constraints on their apparent lockstepness being implied transitively through the properties of the return value.
259+
260+
The important consideration then is how a subgroup operation is implemented.
261+
When a subgroup operation doesn't explicitly state that they all have to execute at the same time by all invocations, we can imagine a scenario where a shuffle may be as simple as the receiving invocation snooping another's register without requiring any action on the latter's part.
262+
And that comes with obvious IPC dangers, as snooping it before it gets written or after it gets overwritten if there are no other execution dependencies will surely provide inconsistent results.
263+
264+
This leads to code listings like the following becoming undefined behavior simply by changing the `Broadcast` into a `Shuffle`.
265+
266+
```cpp
267+
// Broadcasting after computation
268+
// OK, only counts active invocations in tangle (doesn't change)
269+
int count = subgroupBallotBitCount(true);
270+
// OK, done on a constant
271+
int index = subgroupExclusiveAdd(1);
272+
int base, base_slot;
273+
if (subgroupElect())
274+
base_slot = atomicAdd(dst.size,count);
275+
// NOT OK, `base_slot` not available, visible or other invocations may even have raced ahead of the elected one
276+
// Not every invocation will see the correct value of `base_slot` in the elected one memory dependency not ensured
277+
base = subgroupBroadcastFirst(base_slot);
278+
```
279+
280+
Similarly again, with [this example from the Khronos blog on maximal reconvergence](https://www.khronos.org/blog/khronos-releases-maximal-reconvergence-and-quad-control-extensions-for-vulkan-and-spir-v)
281+
282+
```cpp
283+
// OK, thanks to subgroup uniform control flow, no wiggle room here (need to know all invcocation values)
284+
if (subgroupAny(needs_space)) {
285+
// OK, narrowly because `subgroupBallot` returns a ballot thats uniform in a tangle
286+
uvec4 mask = subgroupBallot(needs_space);
287+
// OK, because `mask` is tangle-uniform
288+
uint size = subgroupBallotBitCount(mask);
289+
uint base = 0;
290+
if (subgroupElect())
291+
base = atomicAdd(b.free, size);
292+
293+
// NOT OK if replaced Broadcast with Shuffle, non-elected invocations could race ahead or not see (visibility) the `base` value in the elected invocation before that one would excecute a shuffle
294+
base = subgroupBroadcastFirst(base);
295+
// OK, but only because `mask` is tangle-uniform
296+
uint offset = subgroupBallotExclusiveBitCount(mask);
297+
298+
if (needs_space)
299+
b.data[base + offset] = ...;
300+
}
301+
```
302+
303+
With all that said, it needs to be noted that one can't expect every instruction to run in lockstep, as that would negate the advantages of Nvidia's IPC.
245304

246305
----------------------------
247306
_This issue was observed happening inconsistently on Nvidia driver version 576.80, released 17th June 2025._

0 commit comments

Comments
 (0)