Pin cuda-toolkit wheel to container's CTK major.minor in CI#8160
Pin cuda-toolkit wheel to container's CTK major.minor in CI#8160leofang wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
Previously, Python test scripts extracted only the major version from nvcc (e.g. 12) and installed cuda-toolkit==12.*, which floated to the latest 12.x from PyPI regardless of the container's actual CTK version. This masked issues like the nvrtc compiler bug in CTK 12.4. Use PIP_CONSTRAINT to pin cuda-toolkit==X.Y.* (e.g. 12.9.*) matching the container's nvcc, ensuring CI tests exercise the exact same cuda-toolkit minor version as the devcontainer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d53a49f to
a89e9d2
Compare
|
/ok to test 7f3f7a1 |
🥳 CI Workflow Results🟩 Finished in 2h 55m: Pass: 100%/445 | Total: 7d 04h | Max: 2h 54m | Hits: 86%/512649See results here. |
| cuda_version=$(nvcc --version | grep release | awk '{print $6}' | tr -d ',' | cut -d '.' -f 1-2 | cut -d 'V' -f 2) | ||
| cuda_major_version=$(echo "$cuda_version" | cut -d '.' -f 1) | ||
| export PIP_CONSTRAINT="${TMPDIR:-/tmp}/ctk-constraint.txt" | ||
| echo "cuda-toolkit==${cuda_version}.*" > "$PIP_CONSTRAINT" |
There was a problem hiding this comment.
Question: I may be missing something, but how does the constraint file get propagated to the pip install command?
I understand that this fixes things in our CI, but can you clarify whether this is a temporary workaround? Longer term, I would prefer if no one is responsible for specifying a CUDA minor version; it's an additional burden that we essentially pass on to users. It would be helpful if we can answer the following questions:
|
I don't think this is a temporary workaround. My understanding is that this PR ensures that if we test with a 12.0 container, we use CTK 12.0. This would have helped us catch issues like #8138. This will only help minor version compatibility, because we ensure that a larger set of minor versions work |
This is the problematic part. I think we should be able to use any 12.x on a 12.y container (or, my understanding of MVC is wrong, and I would love to understand what the gap is) |
|
For some reason I missed the notification. I think MVC does not work like that -- All CTK components must come from the same major.minor. MVC concerns CTK's version with respect to the UMD version. Without pinning all CTK components at major.minor, we have a potential mix-n-match (between CTK from wheels vs from container) which is not a supported use case. Either go with wheel pinning (to ensure wheel and container have the same major.minor), or follow CUDA Python and not use any CTK container at all (wheel alone decides the major.ninor). |
|
(The only outlier is nvJitLink, which is allowed to be higher than all other CTK's major.minor. I assumed this is well-known.) |
Summary
cuda-toolkitPyPI wheels in CI to match the container's exact CTK major.minor version (e.g.cuda-toolkit==12.9.*) instead of floating to the latest (e.g.cuda-toolkit==12.*)PIP_CONSTRAINTmechanism — no changes topyproject.toml, so end users are unaffectedtest_cuda_cccl_headers,test_cuda_compute,test_cuda_coop,test_cuda_cccl_examplesMotivation
The Python CI test scripts previously extracted only the major version from
nvcc(e.g.12) and installedcuda-toolkit==12.*, which always resolved to the latest 12.x from PyPI regardless of the container's actual CTK version. This meant that even when running in a CTK 12.0 container, CI would installcuda-toolkit12.9 wheels (as discovered in #8139 (comment)).This masked issues like the nvrtc compiler bug in CTK 12.4, because the pip-installed nvrtc (latest) was always used instead of the container's version.
How it works
Each test script now also extracts the
X.Yversion fromnvcc --versionand writes a pip constraint file:PIP_CONSTRAINTis a standard pip environment variable that automatically constrains all subsequentpip installcommands in the script.Test plan
cuda-toolkit==12.0.*is installed instead of 12.9🤖 Generated with Claude Code