-
Notifications
You must be signed in to change notification settings - Fork 16
Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…module-only if this is CUDA-12.6 based but targets CC100 or CC120
|
I think we can make this a little more powerful, by defining a lookup-table that, for a given CUDA Compute Capability, returns the CUDA version in which it was first supported, and the CUDA version in which it was last supported (or "99.9.9" or something, if it is still supported). Then, we do a semantic version comparison to figure out if we are in that range. If not, we add an informative error message to the module, and generate with |
…ted configurations more generic. Then, also apply this to unsupported combinations of CUDA toolkit versions and requested CUDA compute capabilities. TODO: actually implement a function that checks this compatibility
…da_version actually returns 'None' if CUDA was not in the deps
…r in the pre-module hook
…ed by the generic X_prepare_hook_unsupported_modules
…laced by generic hooks
…nvironment variables don't contain invalid characters like commas and periods. Add some warning messages if installing a module that's unsupported.
| # Supported compute capabilities by CUDA toolkit version | ||
| # Obtained by installing all CUDAs from 12.0.0 to 13.1.0, then using: | ||
|
|
||
| # #!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth leaving this here as a breadcrumb to future contributors, since we'll have to update this list occasionally and doing it manually is silly - especially if you want to add compatibility for a range of toolkit versions
| # Clean cuda_cc of any suffixes like the 'a' in '9.0a' | ||
| # The regex expects one or more digits, a dot, one or more digits, and then optionally any number of characters | ||
| # It will strip all characters by only return the first capture group (the digits and dot) | ||
| cuda_cc = re.sub(r'^(\d+\.\d+)[a-zA-Z]*$', r'\1', cuda_cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lookup table contains CCs in the format of 90, 100, etc, so no periods, and no suffixes. The CUDA compute capabilities passed to EasyBuild contain periods (for sure) and can contain suffixes. So to compare, we need to strip the suffix from EB's CUDA CC, and remove the ..
| # Always trigger this one, regardless of ec.name | ||
| cpu_target = get_eessi_envvar('EESSI_SOFTWARE_SUBDIR') | ||
| if cpu_target == CPU_TARGET_ZEN4: | ||
| parse_hook_zen4_module_only(ec, eprefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now handled in the pre_module_hook_unsupported_modules.
| print_msg(msg % (new_parallel, curr_parallel, session_parallel, self.name, cpu_target), log=self.log) | ||
|
|
||
|
|
||
| def pre_prepare_hook_unsupported_modules(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaces the specific pre_prepare_hook_ignore_zen4_gcccore1220_error we had before.
| os.environ[unsup_mod.envvar] = "1" | ||
|
|
||
|
|
||
| def post_prepare_hook_unsupported_modules(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaces the post_prepare_hook_ignore_zen4_gcccore1220_error hook we had before
| if cpu_target == CPU_TARGET_ZEN4: | ||
| pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs) | ||
| # Always trigger this, regardless of ec.name | ||
| pre_prepare_hook_unsupported_modules(self, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.
| if cpu_target == CPU_TARGET_ZEN4: | ||
| post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs) | ||
| # Always trigger this, regardless of ec.name | ||
| post_prepare_hook_unsupported_modules(self, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.
| print_msg("Changed toolchainopts for %s: %s", ec.name, ec['toolchainopts']) | ||
|
|
||
|
|
||
| def parse_hook_zen4_module_only(ec, eprefix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the LmodError to the modluafooter is now done in the generic pre_module_hook_unsupported_module hook
|
|
||
|
|
||
| def is_unsupported_module(ec): | ||
| class UnsupportedModule(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a named tuple so that we can have access to the environment variable name and error message through clearly named attributes. That's less sensitive to messing up compared to a regular tuple, where you'd have to remember what is stored in the first and what is stored in the second element of the tuple.
|
|
||
| if cpu_target == CPU_TARGET_ZEN4 and is_gcccore_1220_based(ecname=ec.name, ecversion=ec.version, tcname=ec.toolchain.name, tcversion=ec.toolchain.version): | ||
| return EESSI_IGNORE_ZEN4_GCC1220_ENVVAR | ||
| # If this function was already called by an earlier hook, evaluation of whether this is an unsupported module was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point in time, the is_unsupported_module function is called 6 or 7 times. Since it may become quite lengthy with lots of logic if we keep adding cases for modules that are unsupported, we want an early return for optimization in case this has already been evaluated before. We can easily do that by checking if either the EESSI_SUPPORTED_MODULE_ATTR or EESSI_UNSUPPORTED_MODULE_ATTR have been set.
If neither has been set, this is the first time we are evaluating this function and we should go through the full logic.
| elif hasattr(self, EESSI_UNSUPPORTED_MODULE_ATTR): | ||
| return True | ||
|
|
||
| # Foss-2022b is not supported on Zen4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time we have unsupported modules, this function is the only one that needs changing: we simply add a case to it. A case typically has:
- Logic (if statements) to determine if this is an unsupported module
- Print a warning message to stdout to make it clear we're doing something out-of-the-ordinary in this installation
- Define the LmodError message that should be embedded in the modulefile
- Define the environment variable name that can be used to suppress the LmodError
| ignore_lmoderror_envvar = is_unsupported_module(self) | ||
| if ignore_lmoderror_envvar: | ||
| if is_unsupported_module(self): | ||
| unsup_mod = getattr(self, EESSI_UNSUPPORTED_MODULE_ATTR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get the UnsupportedModule tuple, so we can use it to set the environment variable that suppresses the LmodError.
| # Modules for dependencies are loaded in the prepare step. Thus, that's where we need this variable to be set | ||
| # so that the modules can be succesfully loaded without printing the error (so that we can create a module | ||
| # _with_ the warning for the current software being installed) | ||
| def pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by generic pre_prepare_hook_unsupported_modules
| os.environ[EESSI_IGNORE_ZEN4_GCC1220_ENVVAR] = "1" | ||
|
|
||
|
|
||
| def post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by generic post_prepare_hook_unsupported_modules
Add some initial changes to the hooks to make sure to install with --module-only if this is CUDA-12.6 based but targets CC100 or CC120.
This still needs to be completed. Also, I could potentially make it more clever and match anything that's >=CC100.
Edit 08-01: This PR now does a two things.
is_unsupported_moduleTo achieve this, I first made the current mechanism that was there for handling the Zen4+foss-2022b incompatibility more generic.
parse_hook_zen4_module_only(took care of adding the LmodError in amodluafooter),pre_prepare_hook_ignore_zen4_gcccore1220_error(sets env var to suppress the LmodError when building other software on top, so that you create modules dependent on the unsupported module) andpost_prepare_hook_ignore_zen4_gcccore1220_error(unset that env var), with more generic hookspre_module_hook, since some information relevant to determining if a module is unsupported may not be available as early as theparse_hook(such as the requested CUDA compute capability)Then, I
is_unsupported_module.I've left some reviewer-comments in the changed files to make it easier for anyone reviewing this to see why things were changed.
I then ran tests to a) validate that it still worked for zen4+foss-2022b and b) did what it should do for the CUDA CC vs Toolkit version (in)compatibility. Results are in summary below.
zen4+foss-2022b test
Environment:
Build log:
Test loading module:
This is indeed the
LmodErrorwe expectedzen4+foss-2023b test
Environment:
Build log:
Test loading module:
This is what we expect - this installation should be unaltered.
CUDA Compute Capability 10.0 with CUDA Toolkit 10.6.0 (incompatible)
Environment:
Build log:
Test loading module:
This is what we expect.
CUDA Compute Capability 9.0 with CUDA Toolkit 10.6.0 (compatible)
Environment:
Build log:
Test loading module:
This is, again, what we expect, since there is support for CC9.0 in CUDA 12.6.0