Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used #146

casparvl · 2025-12-24T16:40:06Z

Add some initial changes to the hooks to make sure to install with --module-only if this is CUDA-12.6 based but targets CC100 or CC120.

This still needs to be completed. Also, I could potentially make it more clever and match anything that's >=CC100.

Edit 08-01: This PR now does a two things.

Make the way in which unsupported modules are handled more generic. With that, if we ever need to add new cases of unsupported module, the thing we need to do is add the new case to is_unsupported_module
Use that generic approach to check for compatibility between CUDA Compute Capability and CUDA toolkit version - and treat it as an unsupported module if it's not compatible

To achieve this, I first made the current mechanism that was there for handling the Zen4+foss-2022b incompatibility more generic.

Replace specific hooks like parse_hook_zen4_module_only (took care of adding the LmodError in a modluafooter), pre_prepare_hook_ignore_zen4_gcccore1220_error (sets env var to suppress the LmodError when building other software on top, so that you create modules dependent on the unsupported module) and post_prepare_hook_ignore_zen4_gcccore1220_error (unset that env var), with more generic hooks
Create a NamedTuple to hold two pieces of information:
- The name of the environment variable to suppress the associated LmodError
- The text for the LmodError
Set that NamedTuple as an attribute
Make other hooks use that attribute to set the relevant environment variable & error message
Moved setting the luafooter to the pre_module_hook, since some information relevant to determining if a module is unsupported may not be available as early as the parse_hook (such as the requested CUDA compute capability)

Then, I

implemented logic to check for the CUDA compatibility
implemented a message & env-var for the CUDA case in is_unsupported_module.

I've left some reviewer-comments in the changed files to make it easier for anyone reviewing this to see why things were changed.

I then ran tests to a) validate that it still worked for zen4+foss-2022b and b) did what it should do for the CUDA CC vs Toolkit version (in)compatibility. Results are in summary below.

zen4+foss-2022b test

Environment:

module load EESSI/2023.06
module load EESSI-extend/2023.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --hooks eb_hooks.py h5py-3.8.0-foss-2022b.eb --rebuild

== Temporary log file in case of crash /scratch-local/casparl.18163544/eb-37mpwjnj/easybuild-cuu9nk5g.log
...
== processing EasyBuild easyconfig
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/EasyBuild/5.2.0/easybuild/easyconfigs/h/h5py/h5py-3.8.0-foss-2022b.eb
== building and installing h5py/3.8.0-foss-2022b...
  >> installation prefix: /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.8.0-foss-2022b
== fetching files and verifying checksums...
== Running pre-fetch hook...

WARNING: EasyConfigs using toolchains based on GCCcore-12.2.0 are not supported on Zen4 architectures. Building with
'--module-only --force' and injecting an LmodError into the modulefile.

== Updated build option 'module-only' to 'True'
== Updated build option 'force' to 'True'
...
== Setting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 to allow loading dependencies that otherwise throw an LmodError
  >> loading toolchain module: foss/2022b
  >> loading modules for build dependencies:
  >>  * pkgconfig/1.5.5-GCCcore-12.2.0-python
  >> loading modules for (runtime) dependencies:
  >>  * Python/3.10.8-GCCcore-12.2.0
  >>  * SciPy-bundle/2023.02-gfbf-2022b
  >>  * mpi4py/3.1.4-gompi-2022b
  >>  * HDF5/1.14.0-gompi-2022b
  >> defining build environment for foss/2022b toolchain
== Running post-prepare hook...
== Resetting rpath_override_dirs to original value: None
== Unsetting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220
...
== Running pre-module hook...
== Setting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 in initial environment
  >> generating module file @
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/h5py/3.8.0-foss-2022b.lua
== Running post-module hook...
== Restored original build option 'module_only' to False
== Restored original build option 'force' to False
== Removing EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 in initial environment
== ... (took 1 secs)
...
== Results of the build can be found in the log file(s)
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.8.0-foss-2022b/easybuild/easybuild-h5py-3.8.0-20260108.171940.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 5 secs)
== Summary:
   * [SUCCESS] h5py/3.8.0-foss-2022b
== Temporary log file(s) /scratch-local/casparl.18163544/eb-37mpwjnj/easybuild-cuu9nk5g.log* have been removed.
== Temporary directory /scratch-local/casparl.18163544/eb-37mpwjnj has been removed.

Test loading module:

module load h5py/3.8.0-foss-2022b

Lmod has detected the following error:  EasyConfigs using toolchains based on GCCcore-12.2.0 are not supported for
the Zen4 architecture.
See
https://www.eessi.io/docs/known_issues/eessi-2023.06/#gcc-1220-and-foss-2022b-based-modules-cannot-be-loaded-on-zen4-architecture
While processing the following module(s):
    Module fullname        Module Filename
    ---------------        ---------------
    GCCcore/12.2.0         /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/GCCcore/12.2.0.lua
    GCC/12.2.0             /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/GCC/12.2.0.lua
    foss/2022b             /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/foss/2022b.lua
    h5py/3.8.0-foss-2022b  /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/h5py/3.8.0-foss-2022b.lua

This is indeed the LmodError we expected

zen4+foss-2023b test

Environment:

module load EESSI/2023.06
module load EESSI-extend/2023.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --hooks eb_hooks.py h5py-3.11.0-foss-2023b.eb --rebuild

...
== ... (took < 1 sec)
  >> running shell command:
        bzip2 /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.11.0-foss-2023b/easybuild/easybuild-h5py-3.11.0-20260108.174439.log
        [started at: 2026-01-08 17:44:39]
        [working dir: /gpfs/home4/casparl/EESSI/software-layer-scripts]
        [output and state saved to /scratch-local/casparl.18163544/eb-_pcfey5c/run-shell-cmd-output/bzip2-ld1r6gr3]
  >> command completed: exit 0, ran in < 1s
== COMPLETED: Installation ended successfully (took 2 mins 53 secs)

Test loading module:

$ module load h5py/3.11.0-foss-2023b
$ echo $EBROOTH5PY
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.11.0-foss-2023b

This is what we expect - this installation should be unaltered.

CUDA Compute Capability 10.0 with CUDA Toolkit 10.6.0 (incompatible)

Environment:

module load EESSI/2025.06
module load EESSI-extend/2025.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --sourcepath=/home/casparl/.local/easybuild/sources --hooks eb_hooks.py --accept-eula-for=CUDA --cuda-compute-capabilities=10.0a CUDA-12.6.0.eb --rebuild

...
== Running pre-fetch hook...

WARNING: Requested a CUDA Compute Capability (['10.0a']) that is not supported by the CUDA toolkit version (12.6.0) used by this software. Switching to '--module-only --force' and injectiong an LmodError into the modulefile.

== Updated build option 'module-only' to 'True'
...
== Setting EESSI_IGNORE_CUDA_12_6_0_CC_10_0 to allow loading dependencies that otherwise throw an LmodError
== Running post-prepare hook...
== Unsetting EESSI_IGNORE_CUDA_12_6_0_CC_10_0
...
== Running pre-module hook...
== Setting EESSI_IGNORE_CUDA_12_6_0_CC_10_0 in initial environment
  >> generating module file @ /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/CUDA/12.6.0.lua
== Running post-module hook...
== Restored original build option 'module_only' to False
== Restored original build option 'force' to False
== Removing EESSI_IGNORE_CUDA_12_6_0_CC_10_0 in initial environment
...
== COMPLETED: Installation ended successfully (took 18 secs)
== Results of the build can be found in the log file(s) /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/easybuild/easybuild-CUDA-12.6.0-20260108.174859.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 20 secs)
== Summary:
   * [SUCCESS] CUDA/12.6.0

Test loading module:

$ module load CUDA/12.6.0
Lmod has detected the following error:  EasyConfigs using CUDA 12.6.0 or older are not supported for (all) requested Compute Capabilities: ['10.0a'].

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    CUDA/12.6.0      /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/CUDA/12.6.0.lua

This is what we expect.

CUDA Compute Capability 9.0 with CUDA Toolkit 10.6.0 (compatible)

Environment:

module load EESSI/2025.06
module load EESSI-extend/2025.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --sourcepath=/home/casparl/.local/easybuild/sources --hooks eb_hooks.py --accept-eula-for=CUDA --cuda-compute-capabilities=10.0a CUDA-12.6.0.eb --rebuild

...
== COMPLETED: Installation ended successfully (took 2 mins 52 secs)
== Results of the build can be found in the log file(s) /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/easybuild/easybuild-CUDA-12.6.0-20260108.175337.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 2 mins 54 secs)
== Summary:
   * [SUCCESS] CUDA/12.6.0

Test loading module:

$ module load CUDA/12.6.0
$ echo $EBROOTCUDA
/home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0

This is, again, what we expect, since there is support for CC9.0 in CUDA 12.6.0

…module-only if this is CUDA-12.6 based but targets CC100 or CC120

casparvl · 2026-01-05T10:34:27Z

I think we can make this a little more powerful, by defining a lookup-table that, for a given CUDA Compute Capability, returns the CUDA version in which it was first supported, and the CUDA version in which it was last supported (or "99.9.9" or something, if it is still supported). Then, we do a semantic version comparison to figure out if we are in that range. If not, we add an informative error message to the module, and generate with --module-only.

…ted configurations more generic. Then, also apply this to unsupported combinations of CUDA toolkit versions and requested CUDA compute capabilities. TODO: actually implement a function that checks this compatibility

…da_version actually returns 'None' if CUDA was not in the deps

…r in the pre-module hook

…ed by the generic X_prepare_hook_unsupported_modules

…laced by generic hooks

…nvironment variables don't contain invalid characters like commas and periods. Add some warning messages if installing a module that's unsupported.

casparvl · 2026-01-08T17:04:14Z

eb_hooks.py

+# Supported compute capabilities by CUDA toolkit version
+# Obtained by installing all CUDAs from 12.0.0 to 13.1.0, then using:
+
+# #!/bin/bash


I think it's worth leaving this here as a breadcrumb to future contributors, since we'll have to update this list occasionally and doing it manually is silly - especially if you want to add compatibility for a range of toolkit versions

casparvl · 2026-01-08T17:06:04Z

eb_hooks.py

+    # Clean cuda_cc of any suffixes like the 'a' in '9.0a'
+    # The regex expects one or more digits, a dot, one or more digits, and then optionally any number of characters
+    # It will strip all characters by only return the first capture group (the digits and dot)
+    cuda_cc = re.sub(r'^(\d+\.\d+)[a-zA-Z]*$', r'\1', cuda_cc)


The lookup table contains CCs in the format of 90, 100, etc, so no periods, and no suffixes. The CUDA compute capabilities passed to EasyBuild contain periods (for sure) and can contain suffixes. So to compare, we need to strip the suffix from EB's CUDA CC, and remove the ..

casparvl · 2026-01-08T17:07:54Z

eb_hooks.py

-    # Always trigger this one, regardless of ec.name
-    cpu_target = get_eessi_envvar('EESSI_SOFTWARE_SUBDIR')
-    if cpu_target == CPU_TARGET_ZEN4:
-        parse_hook_zen4_module_only(ec, eprefix)


This is now handled in the pre_module_hook_unsupported_modules.

casparvl · 2026-01-08T17:08:30Z

eb_hooks.py

        print_msg(msg % (new_parallel, curr_parallel, session_parallel, self.name, cpu_target), log=self.log)


+def pre_prepare_hook_unsupported_modules(self, *args, **kwargs):


Replaces the specific pre_prepare_hook_ignore_zen4_gcccore1220_error we had before.

casparvl · 2026-01-08T17:08:55Z

eb_hooks.py

+        os.environ[unsup_mod.envvar] = "1"
+
+
+def post_prepare_hook_unsupported_modules(self, *args, **kwargs):


Replaces the post_prepare_hook_ignore_zen4_gcccore1220_error hook we had before

casparvl · 2026-01-08T17:09:36Z

eb_hooks.py

-    if cpu_target == CPU_TARGET_ZEN4:
-        pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs)
+    # Always trigger this, regardless of ec.name
+    pre_prepare_hook_unsupported_modules(self, *args, **kwargs)


Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.

casparvl · 2026-01-08T17:10:00Z

eb_hooks.py

-    if cpu_target == CPU_TARGET_ZEN4:
-        post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs)
+    # Always trigger this, regardless of ec.name
+    post_prepare_hook_unsupported_modules(self, *args, **kwargs)


Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.

casparvl · 2026-01-08T17:10:40Z

eb_hooks.py

            print_msg("Changed toolchainopts for %s: %s", ec.name, ec['toolchainopts'])


-def parse_hook_zen4_module_only(ec, eprefix):


Adding the LmodError to the modluafooter is now done in the generic pre_module_hook_unsupported_module hook

casparvl · 2026-01-08T17:12:42Z

eb_hooks.py



-def is_unsupported_module(ec):
+class UnsupportedModule(NamedTuple):


Add a named tuple so that we can have access to the environment variable name and error message through clearly named attributes. That's less sensitive to messing up compared to a regular tuple, where you'd have to remember what is stored in the first and what is stored in the second element of the tuple.

casparvl · 2026-01-08T17:14:59Z

eb_hooks.py


-    if cpu_target == CPU_TARGET_ZEN4 and is_gcccore_1220_based(ecname=ec.name, ecversion=ec.version, tcname=ec.toolchain.name, tcversion=ec.toolchain.version):
-        return EESSI_IGNORE_ZEN4_GCC1220_ENVVAR
+    # If this function was already called by an earlier hook, evaluation of whether this is an unsupported module was


At this point in time, the is_unsupported_module function is called 6 or 7 times. Since it may become quite lengthy with lots of logic if we keep adding cases for modules that are unsupported, we want an early return for optimization in case this has already been evaluated before. We can easily do that by checking if either the EESSI_SUPPORTED_MODULE_ATTR or EESSI_UNSUPPORTED_MODULE_ATTR have been set.

If neither has been set, this is the first time we are evaluating this function and we should go through the full logic.

…environment name

casparvl · 2026-01-08T17:21:32Z

eb_hooks.py

+    elif hasattr(self, EESSI_UNSUPPORTED_MODULE_ATTR):
+        return True
+
+    # Foss-2022b is not supported on Zen4


Next time we have unsupported modules, this function is the only one that needs changing: we simply add a case to it. A case typically has:

Logic (if statements) to determine if this is an unsupported module

Print a warning message to stdout to make it clear we're doing something out-of-the-ordinary in this installation

Define the LmodError message that should be embedded in the modulefile

Define the environment variable name that can be used to suppress the LmodError

casparvl · 2026-01-08T17:23:19Z

eb_hooks.py

-    ignore_lmoderror_envvar = is_unsupported_module(self)
-    if ignore_lmoderror_envvar:
+    if is_unsupported_module(self):
+        unsup_mod = getattr(self, EESSI_UNSUPPORTED_MODULE_ATTR)


get the UnsupportedModule tuple, so we can use it to set the environment variable that suppresses the LmodError.

casparvl · 2026-01-08T17:24:14Z

eb_hooks.py

-# Modules for dependencies are loaded in the prepare step. Thus, that's where we need this variable to be set
-# so that the modules can be succesfully loaded without printing the error (so that we can create a module
-# _with_ the warning for the current software being installed)
-def pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs):


Replaced by generic pre_prepare_hook_unsupported_modules

casparvl · 2026-01-08T17:24:24Z

eb_hooks.py

-        os.environ[EESSI_IGNORE_ZEN4_GCC1220_ENVVAR] = "1"
-
-
-def post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs):


Replaced by generic post_prepare_hook_unsupported_modules

Add some initial changes to the hooks to make sure to install with --…

3c9618b

…module-only if this is CUDA-12.6 based but targets CC100 or CC120

casparvl marked this pull request as draft December 24, 2025 16:40

Caspar van Leeuwen and others added 6 commits January 7, 2026 18:11

Remove some variables that have become obsolete, and make sure get_cu…

b5fa942

…da_version actually returns 'None' if CUDA was not in the deps

Remove the now obsolete zen4 parse hook - we now inject the lmodfoote…

74351d4

…r in the pre-module hook

Remove zen4-specific pre and post prepare hooks, as these were replac…

2d2cdff

…ed by the generic X_prepare_hook_unsupported_modules

Remove the prepare_hooks that were specific to zen4, as they were rep…

e5f5cd2

…laced by generic hooks

Actually implement is_cuda_cc_supported_by_toolkit. Also, make sure e…

0d40193

…nvironment variables don't contain invalid characters like commas and periods. Add some warning messages if installing a module that's unsupported.

casparvl changed the title ~~Use module-only for Cuda 12.6 and CC100 or CC120~~ Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used Jan 8, 2026

casparvl marked this pull request as ready for review January 8, 2026 17:02

casparvl commented Jan 8, 2026

View reviewed changes

Move import to the top

5a2256b

casparvl commented Jan 8, 2026

View reviewed changes

Fix description for 'is_supported_module' as it no longer returns an …

0d745e7

…environment name

casparvl commented Jan 8, 2026

View reviewed changes

		print_msg(msg % (new_parallel, curr_parallel, session_parallel, self.name, cpu_target), log=self.log)


		def pre_prepare_hook_unsupported_modules(self, args, *kwargs):

		os.environ[unsup_mod.envvar] = "1"


		def post_prepare_hook_unsupported_modules(self, args, *kwargs):

		print_msg("Changed toolchainopts for %s: %s", ec.name, ec['toolchainopts'])


		def parse_hook_zen4_module_only(ec, eprefix):



		def is_unsupported_module(ec):
		class UnsupportedModule(NamedTuple):

		os.environ[EESSI_IGNORE_ZEN4_GCC1220_ENVVAR] = "1"


		def post_prepare_hook_ignore_zen4_gcccore1220_error(self, args, *kwargs):

Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used #146

Are you sure you want to change the base?

Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used #146

Uh oh!

Conversation

casparvl commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casparvl commented Jan 5, 2026

Uh oh!

casparvl Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casparvl Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casparvl Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casparvl Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

casparvl commented Dec 24, 2025 •

edited

Loading

casparvl Jan 8, 2026 •

edited

Loading

casparvl Jan 8, 2026 •

edited

Loading

casparvl Jan 8, 2026 •

edited

Loading

casparvl Jan 8, 2026 •

edited

Loading