Skip to content

Enable rocm/7.2.1#67

Open
michaelmckinsey1 wants to merge 1 commit intoLBANN:mainfrom
michaelmckinsey1:rocm-721
Open

Enable rocm/7.2.1#67
michaelmckinsey1 wants to merge 1 commit intoLBANN:mainfrom
michaelmckinsey1:rocm-721

Conversation

@michaelmckinsey1
Copy link
Copy Markdown
Collaborator

Enable rocm/7.2.1 using public AMD wheels https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2.1/. Apparently moving forward, we should not rely on the WCI wheels, so it is likely we will maintain scripts/install-tuolumne-torchpypi.sh. And use pre-installed rccl plugin /collab/usr/global/tools/rccl/toss_4_x86_64_ib_cray/rocm-7.2.0/install/lib/librccl-net.so

  • Requires python/3.12 (no 3.11 wheel when I last checked)

At scale 7, 1,1,2 sharding, 10 epochs, rocm/7.2.1 is slightly faster than 7.1.0:

  • 1% faster on 1 node
  • 6% faster on 2 nodes
  • 4% faster on 4 nodes

@michaelmckinsey1 michaelmckinsey1 self-assigned this May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant