Skip to content

Commit 8ef161d

Browse files
committed
first draft of subgroup scan article
1 parent c02da36 commit 8ef161d

File tree

2 files changed

+211
-1
lines changed

2 files changed

+211
-1
lines changed
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: 'Nvidia SPIR-V Compiler Bug or Do Subgroup Shuffle Operations Not Imply Reconvergence?'
3+
slug: 'subgroup-shuffle-reconvergence-on-nvidia'
4+
description: "A look at the behavior behind Nabla's subgroup scan"
5+
date: '2025-06-19'
6+
authors: ['keptsecret']
7+
tags: ['nabla', 'vulkan', 'article']
8+
last_update:
9+
date: '2025-06-19'
10+
author: keptsecret
11+
---
12+
13+
Reduce and scan operations are core building blocks in the world of parallel computing, and now Nabla has a new release with those operations made even faster for Vulkan at the subgroup and workgroup levels.
14+
15+
This article takes a brief look at the Nabla implementation for reduce and scan on the GPU in Vulkan, and then a discussion on expected reconvergence behavior after subgroup operations.
16+
17+
<!-- truncate -->
18+
19+
## Reduce and Scan
20+
21+
Let's give a quick introduction, or recap for those already familiar, to reduce and scan operations.
22+
23+
A reduction takes a binary associative operator $\bigoplus$ and an array of $n$ elements
24+
25+
$\left[x_0, x_1,...,x_{n-1}\right]$,
26+
27+
and returns
28+
29+
$x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1}$.
30+
31+
In other words, when $\bigoplus$ is an addition, a reduction of the array $X$ is then the sum of all elements of array $X$.
32+
33+
```
34+
Input: 4 6 2 3 7 1 0 5
35+
Reduction: 28
36+
```
37+
38+
A scan is a generalization of reduction, and takes a binary associative operator $\bigoplus$ with identity $I$ and an array of $n$ elements.
39+
Then, for each element, performs the reduction from the first element to the current element.
40+
An _exclusive_ scan does so for all elements before the current element.
41+
42+
$\left[I, x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-2})\right]$.
43+
44+
An _inclusive_ scan then includes the current element as well.
45+
46+
$\left[x_0, (x_0 \bigoplus x_1), ..., (x_0 \bigoplus x_1 \bigoplus ... \bigoplus x_{n-1})\right]$.
47+
48+
Notice the last element of the inclusive scan is the same as the reduction.
49+
50+
```
51+
Input: 4 6 2 3 7 1 0 5
52+
Exclusive: 0 4 10 12 15 22 23 23
53+
Inclusive: 4 10 12 15 22 23 23 28
54+
```
55+
56+
## Nabla's subgroup scans
57+
58+
We start with the most basic of building blocks: doing a reduction or a scan in the local subgroup of a Vulkan device.
59+
Pretty simple actually, since Vulkan already has subgroup arithmetic operations supported via SPIRV, and it's all available in Nabla.
60+
61+
```cpp
62+
nbl::hlsl::glsl::groupAdd(T value)
63+
nbl::hlsl::glsl::groupInclusiveAdd(T value)
64+
nbl::hlsl::glsl::groupExclusiveAdd(T value)
65+
etc...
66+
```
67+
68+
But wait, the SPIRV-provided operations all require your Vulkan physical device to have support the `GroupNonUniformArithmetic` capability.
69+
So, Nabla provides emulated versions for that too, and it's all compiled into a single templated struct call.
70+
71+
```cpp
72+
template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
73+
struct inclusive_scan;
74+
75+
template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
76+
struct exclusive_scan;
77+
78+
template<class Params, class BinOp, uint32_t ItemsPerInvocation, bool native>
79+
struct reduction;
80+
```
81+
82+
The implementation of emulated subgroup scans make use of subgroup shuffle operations to access partial sums from other invocations in the subgroup.
83+
84+
```cpp
85+
T inclusive_scan(T value)
86+
{
87+
rhs = shuffleUp(value, 1)
88+
value = value + (firstInvocation ? identity : rhs)
89+
90+
for (i = 1; i < SubgroupSizeLog2; i++)
91+
{
92+
nextLevelStep = 1 << i
93+
rhs = shuffleUp(value, nextLevelStep)
94+
value = value + (nextLevelStep out of bounds ? identity : rhs)
95+
}
96+
return value
97+
}
98+
```
99+
100+
In addition, Nabla also supports passing vectors into these subgroup operations, so you can perform reduce or scans on up to subgroup size * 4 (for `vec4`) elements per call.
101+
Note that it expects the elements in the vectors to be consecutive and in the same order as the input array.
102+
103+
You can find all the implementations on the [Nabla repository](https://github.com/Devsh-Graphics-Programming/Nabla/blob/master/include/nbl/builtin/hlsl/subgroup2/arithmetic_portability_impl.hlsl)
104+
105+
## An issue with subgroup sync and reconvergence
106+
107+
Now, onto a pretty significant, but strangely obscure, problem that I ran into during unit testing this prior to release.
108+
Nabla also has implementations for workgroup reduce and scans that make use of the subgroup scans above, and one such section looks like this.
109+
110+
```cpp
111+
... workgroup scan code ...
112+
113+
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
114+
{
115+
value = getValueFromDataAccessor(memoryIdx)
116+
117+
value = subgroup::inclusive_scan(value)
118+
119+
setValueToDataAccessor(memoryIdx)
120+
121+
if (lastSubgroupInvocation)
122+
{
123+
setValueToSharedMemory(smemIdx)
124+
}
125+
}
126+
control_barrier()
127+
128+
... workgroup scan code ...
129+
```
130+
131+
At first glance, it looks fine, and it does produce the expected results for the most part... except in some very specific cases.
132+
And from some more testing and debugging to try and identify the cause, I've found the conditions to be:
133+
134+
* using an Nvidia GPU
135+
* using emulated versions of subgroup operations
136+
* a decent number of iterations in the loop (in this case at least 8).
137+
138+
I tested this on an Intel GPU, to be sure, and the workgroup scan ran correctly.
139+
That was very baffling initially. And the results produced on an Nvidia device looked like a sync problem.
140+
141+
It was even more convincing when I moved the control barrier inside the loop and it immediately produced correct scan results.
142+
143+
```cpp
144+
... workgroup scan code ...
145+
146+
for (idx = 0; idx < VirtualWorkgroupSize / WorkgroupSize; idx++)
147+
{
148+
value = getValueFromDataAccessor(memoryIdx)
149+
150+
value = subgroup::inclusive_scan(value)
151+
152+
setValueToDataAccessor(memoryIdx)
153+
154+
if (lastSubgroupInvocation)
155+
{
156+
setValueToSharedMemory(smemIdx)
157+
}
158+
control_barrier()
159+
}
160+
161+
... workgroup scan code ...
162+
```
163+
164+
Ultimately, we came to the conclusion that each subgroup invocation was probably somehow not in sync as each loop went on.
165+
Particularly, the last invocation that spends some extra time writing to shared memory may have been lagging behind.
166+
It is a simple fix to the emulated subgroup reduce and scan. A memory barrier was enough.
167+
168+
```cpp
169+
T inclusive_scan(T value)
170+
{
171+
memory_barrier()
172+
173+
rhs = shuffleUp(value, 1)
174+
value = value + (firstInvocation ? identity : rhs)
175+
176+
for (i = 1; i < SubgroupSizeLog2; i++)
177+
{
178+
nextLevelStep = 1 << i
179+
rhs = shuffleUp(value, nextLevelStep)
180+
value = value + (nextLevelStep out of bounds ? identity : rhs)
181+
}
182+
return value
183+
}
184+
```
185+
186+
As a side note, using the `SPV_KHR_maximal_reconvergence` extension doesn't resolve this issue surprisingly.
187+
188+
However, this was only a problem on Nvidia devices.
189+
And as the title of this article states, it's unclear whether this is a bug in Nvidia's SPIRV compiler or subgroup shuffle operations just do not imply reconvergence in the spec.
190+
191+
-------------------
192+
193+
P.S. you may note in the source code that the memory barrier contains the workgroup memory mask, despite us only needing sync in the subgroup scope.
194+
195+
```cpp
196+
spirv::memoryBarrier(spv::ScopeSubgroup, spv::MemorySemanticsWorkgroupMemoryMask | spv::MemorySemanticsAcquireMask);
197+
```
198+
199+
This is because unfortunately, the subgroup memory mask doesn't seem to count as a storage class, at least according to the Vulkan SPIRV validator.
200+
Only the next step up in memory level is valid.
201+
I feel like there's possibly something missing here.

blog/authors.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,4 +32,13 @@ fletterio:
3232
image_url: https://avatars.githubusercontent.com/u/40742817?v=4
3333
page: true
3434
socials:
35-
github: Fletterio
35+
github: Fletterio
36+
37+
keptsecret:
38+
name: Sorakrit Chonwattanagul
39+
title: Junior Developer @ DevSH GP
40+
url: https://github.com/keptsecret/
41+
image_url: https://avatars.githubusercontent.com/u/27181108?v=4
42+
page: true
43+
socials:
44+
github: keptsecret

0 commit comments

Comments
 (0)