Skip to content

CollectivePermute based allgather in host for-loop#5963

Draft
Priya2698 wants to merge 14 commits intopm/stream_broadcastfrom
pm/collective_permute
Draft

CollectivePermute based allgather in host for-loop#5963
Priya2698 wants to merge 14 commits intopm/stream_broadcastfrom
pm/collective_permute

Conversation

@Priya2698
Copy link
Collaborator

No description provided.

@Priya2698
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Feb 12, 2026

Review updated until commit a7bc714

Description

  • Add new CollectivePermute communication type for ring-style allgather using send/recv

  • Implement Swizzle1D-based peer calculation to determine send and receive partners

  • Add CollectivePermute class with send_peer and recv_peer inputs for flexible data routing

  • Update resharding to handle Swizzle1D transforms when computing loop indices

Changes walkthrough

Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
⚡ Recommended focus areas for review
Potential null pointer dereference

In lowerToCollectivePermute at line 354, the code accesses stream_id->definition()->as()
without validating that stream_id->definition() exists or is actually a Swizzle1D. This could cause
a null pointer dereference if the stream_id has no definition or the definition is a different type.
Consider adding validation before accessing the swizzle.

IterDomain* stream_id =
    getShardedIterDomain(output_tv, ParallelType::Stream, DomainType::kLoop);
Swizzle1D* swizzle = stream_id->definition()->as<Swizzle1D>();
ParallelType pt = swizzle->parallelType();
Missing nullptr handling in CollectivePermute

In postSingleCommunication for CollectivePermute (line 600-602), when the device is not in the team,
the function returns nullptr. However, the calling code in evaluator.cpp does not appear to handle
this nullptr return value, which could lead to issues if works_[communication] is expected to be valid.
Verify that nullptr returns are properly handled downstream.

const Team& team = communication->team();
if (std::find(team.begin(), team.end(), my_device_index) == team.end()) {
  return nullptr;
}
Backend validation could be more informative

At line 332, the code checks that CollectivePermute uses NCCL backend and throws a generic error
if not. Consider providing a more informative error message that explains why UCC is not supported
(the comment mentions "UCC does not support coalescing") so users understand the limitation.

// CollectivePermute is only supported with NCCL backend because
// UCC does not support coalescing.
NVF_CHECK_EQ(backend_type, CommunicatorBackend::kNccl);

@Priya2698 Priya2698 changed the base branch from main to pm/stream_broadcast March 4, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant