Support async mode for shm allreduce by gaopengff · Pull Request #484 · pytorch/gloo

gaopengff · 2026-01-20T02:49:16Z

This is to fix CI failure of pytorch in bump PR pytorch/pytorch#172297.
In async mode shmData should be occupied exclusively. We added lock for shmData to make it thread safe and used unique tag to do synchronization among different ranks.

gaopengff · 2026-01-20T03:00:39Z

@d4l3k Could you help review this?

d4l3k

LGTM

d4l3k · 2026-02-06T01:10:48Z

gloo/context.h


  std::shared_ptr<AllreduceSharedMemoryData> shmData;

+  std::mutex shmDataMutex;


What happens if there's multiple gloo process groups? Does that cause issues at all?

also can we put this under shmData?

For multiple process groups scenario, I used gloo context's address to generate unique ID for shm name. In that case different group will use different shm buffer to do allreduce op. I've verified with a test with multiple process groups with pytorch and it passed.

I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?

It works only on multi-processes scenario if we make them static. However, in multi-threads scenario like gloo unit test, each thread will represent a rank and the shm_data is only initialized once, which is not as expected.
Although there is a keywork thread_local to make static variable unique among threads, it may cause performance issue. In real workload such as Pytorch the calling sequence is more like:

context init -> call allreduce -> call allreduce -> call allreduce

thread_local will make shm data initialized every time calling allreduce. The initialization is very expensive as it will allocate shm buffer.

thread_local will make shm data initialized every time calling allreduce

is that correct? I think it will be initialized only the first time but I see your point. Okay I think we can at least wrap this in the macro you created.

OK, I've wrapped shm_data's declaration in the macro I created.

kapilsh

Thanks looks great - I left a few comments. can we also add some tests for covering the new shm all_reduce and possibly resource cleanup tests for shm

kapilsh · 2026-03-07T11:23:01Z

gloo/allreduce.cc

+
+#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)
+  if (context->isIntraNode() && !context->getDevice()->hasGPUDirect()) {
+    algorithm = detail::AllreduceOptionsImpl::SHM;


I dont see users to be able to use explicit algorithm - this will override anything user explicitly specifies. should we check Algorithm::UNSPECIFIED before we override?

I've modified it to make sure it will override Algorithm::UNSPECIFIED only when shm allreduce is applicable. Also I added unit test for shm allreduce in gloo/test/allreduce_test.cc

kapilsh · 2026-03-07T11:30:44Z

gloo/context.h


  std::shared_ptr<AllreduceSharedMemoryData> shmData;

+  std::mutex shmDataMutex;


I think we could not put this under shmData. In the first run shmData is not initialized(nullptr), if there are multiple threads reaching this point, we need to ensure the initialization work is done only by one thread here.

We could potentially move this to allreduce_shm.cc and make both static? that way we keep global context clean of shm specifics. wdyt?

kapilsh · 2026-03-07T11:32:52Z

gloo/allreduce.cc

 #include <array>
 #include <cstring>

+#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__)


this seems to be copied in a bunch of places - can we make this a macro?

Sure, I've defined a macro of this in gloo/allreduce.h, which will be used in unit test too,

#if !defined(_WIN32) && !defined(__aarch64__) && !defined(__arm__) #define GLOO_SHM_ALLREDUCE_APPLICABLE 1 #else #define GLOO_SHM_ALLREDUCE_APPLICABLE 0 #endif

gloo/allreduce_shm.cc

meta-codesync · 2026-03-10T11:38:25Z

@kapilsh has imported this pull request. If you are a Meta employee, you can view this in D95938037.

meta-cla bot added the CLA Signed label Jan 20, 2026

gaopengff mentioned this pull request Jan 20, 2026

third-party/gloo: bumped submodule version to support shared-memory allreduce pytorch/pytorch#172297

Open

support async mode of torch for shm allreduce

1f5e2db

d4l3k approved these changes Feb 6, 2026

View reviewed changes

d4l3k reviewed Feb 6, 2026

View reviewed changes

gaopengff added 2 commits February 6, 2026 15:08

Merge branch 'main' into gaopengf/support_torch_async

fda784d

generate unique shm buffer for each group

874795c

d4l3k mentioned this pull request Feb 7, 2026

Revert "Intra-node shared memory (SHM) optimizations for CPU primitives (#458)" #490

Merged

gaopengff requested a review from d4l3k February 9, 2026 03:40

Merge branch 'main' into gaopengf/support_torch_async

e2f2b73

kapilsh self-requested a review March 7, 2026 11:09

kapilsh requested changes Mar 7, 2026

View reviewed changes

gaopengff added 3 commits March 10, 2026 01:54

rewrite unspecific algorithm and use memcpy for small buffer

f0c4a58

fix macro

b3990aa

fix macro of win

e82036e

gaopengff requested a review from kapilsh March 10, 2026 06:22

kapilsh approved these changes Mar 10, 2026

View reviewed changes

wrap shm data declaration with macro

228bdb8


		std::shared_ptr<AllreduceSharedMemoryData> shmData;

		std::mutex shmDataMutex;

Conversation

gaopengff commented Jan 20, 2026

Uh oh!

gaopengff commented Jan 20, 2026

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kapilsh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

meta-codesync bot commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants