Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 container recipe to the repo’s dockerfile/ collection, targeting the rocm/pytorch-training:v25.6 base image and layering SuperBench build/install steps plus common tooling needed for benchmarks.
Changes:
- Introduces
dockerfile/rocm6.3.x.dockerfilefor a ROCm 6.3.4 + PyTorch training base image. - Installs additional system tools (Docker client, OFED if missing, Intel MLC) and configures SSH/limits.
- Builds SuperBench third-party dependencies and installs the package with AMD worker extras.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #790 +/- ##
=======================================
Coverage 85.69% 85.69%
=======================================
Files 103 103
Lines 7890 7890
=======================================
Hits 6761 6761
Misses 1129 1129
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 build target to the repository’s Docker image build pipeline, enabling CI to produce a superbench/main:rocm6.3 image variant.
Changes:
- Introduces
dockerfile/rocm6.3.x.dockerfilebased onrocm/pytorch-training:v25.6, with additional system deps, Docker CLI, OFED (conditional), and SuperBench build/install steps. - Updates
.github/workflows/build-image.ymlto build and tag the new ROCm 6.3 image on the self-hosted ROCm runner.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| dockerfile/rocm6.3.x.dockerfile | New ROCm 6.3 Dockerfile building SuperBench on top of rocm/pytorch-training:v25.6. |
| .github/workflows/build-image.yml | Adds a rocm6.3 entry to the build matrix to produce/push superbench/main:rocm6.3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Adds a new ROCm 6.3 container build path to the repo, and updates third-party build logic to support explicit ROCm GPU arch selection during rccl-tests compilation.
Changes:
- Add
dockerfile/rocm6.3.x.dockerfilebased onrocm/pytorch-training:v25.6and install additional tooling (Docker client, OFED, MLC). - Update
third_party/Makefileto optionally buildrccl-testswith explicit--offload-archflags whenAMDGPU_TARGETSis provided. - Update the GitHub Actions image build matrix to include a
rocm6.3build.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| third_party/Makefile | Adds conditional arch-aware rccl-tests build flags driven by AMDGPU_TARGETS. |
| dockerfile/rocm6.3.x.dockerfile | Introduces a new ROCm 6.3 image definition and wires in third-party builds and package installs. |
| .github/workflows/build-image.yml | Adds the new rocm6.3 image to CI build/push matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Pin botocore/boto3 to 1.35.98 for reproducible builds - Remove unused ARG NUM_MAKE_JOBS and corresponding CI build_args - Derive UBUNTU_VERSION dynamically via lsb_release inside OFED RUN block - Fix OFED comment to match actual logic - Switch OFED download from HTTP to HTTPS - Split setuptools install into separate RUN layers to avoid masking failures - Add rm -rf .git after build to reduce image size - Change ifdef to ifneq for AMDGPU_TARGETS non-empty check in Makefile
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Set strategy.fail-fast: false in build-image.yml so a transient ROCm self-hosted runner failure does not abort sibling CUDA image builds. - Promote AMDGPU_TARGETS to a build ARG so it can be overridden via --build-arg at docker build time (e.g., to add gfx950 for newer cards). - Add a comment documenting that RCCL is intentionally taken from the base image (no custom build / LD_PRELOAD) for ROCm 6.3.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # - cmake: 3.18.5 | ||
| # - rocm-cmake: 0.14.0.60304-76 | ||
| # - amd-smi: 25.1.0+8dc45db |
|
|
||
| ADD third_party third_party | ||
|
|
||
| RUN make RCCL_HOME=/opt/rocm ROCBLAS_BRANCH=release-staging/rocm-rel-6.3 HIPBLASLT_BRANCH=release-staging/rocm-rel-6.3 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm -o rocm_megatron_lm |
Description
Add ROCm6.3 dockerfile.