Skip to content

Conversation

@casparvl
Copy link
Contributor

@casparvl casparvl commented Dec 24, 2025

Add some initial changes to the hooks to make sure to install with --module-only if this is CUDA-12.6 based but targets CC100 or CC120.

This still needs to be completed. Also, I could potentially make it more clever and match anything that's >=CC100.

Edit 08-01: This PR now does a two things.

  1. Make the way in which unsupported modules are handled more generic. With that, if we ever need to add new cases of unsupported module, the thing we need to do is add the new case to is_unsupported_module
  2. Use that generic approach to check for compatibility between CUDA Compute Capability and CUDA toolkit version - and treat it as an unsupported module if it's not compatible

To achieve this, I first made the current mechanism that was there for handling the Zen4+foss-2022b incompatibility more generic.

  • Replace specific hooks like parse_hook_zen4_module_only (took care of adding the LmodError in a modluafooter), pre_prepare_hook_ignore_zen4_gcccore1220_error (sets env var to suppress the LmodError when building other software on top, so that you create modules dependent on the unsupported module) and post_prepare_hook_ignore_zen4_gcccore1220_error (unset that env var), with more generic hooks
  • Create a NamedTuple to hold two pieces of information:
    • The name of the environment variable to suppress the associated LmodError
    • The text for the LmodError
  • Set that NamedTuple as an attribute
  • Make other hooks use that attribute to set the relevant environment variable & error message
  • Moved setting the luafooter to the pre_module_hook, since some information relevant to determining if a module is unsupported may not be available as early as the parse_hook (such as the requested CUDA compute capability)

Then, I

  • implemented logic to check for the CUDA compatibility
  • implemented a message & env-var for the CUDA case in is_unsupported_module.

I've left some reviewer-comments in the changed files to make it easier for anyone reviewing this to see why things were changed.

I then ran tests to a) validate that it still worked for zen4+foss-2022b and b) did what it should do for the CUDA CC vs Toolkit version (in)compatibility. Results are in summary below.

zen4+foss-2022b test

Environment:

module load EESSI/2023.06
module load EESSI-extend/2023.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --hooks eb_hooks.py h5py-3.8.0-foss-2022b.eb --rebuild

== Temporary log file in case of crash /scratch-local/casparl.18163544/eb-37mpwjnj/easybuild-cuu9nk5g.log
...
== processing EasyBuild easyconfig
/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/EasyBuild/5.2.0/easybuild/easyconfigs/h/h5py/h5py-3.8.0-foss-2022b.eb
== building and installing h5py/3.8.0-foss-2022b...
  >> installation prefix: /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.8.0-foss-2022b
== fetching files and verifying checksums...
== Running pre-fetch hook...

WARNING: EasyConfigs using toolchains based on GCCcore-12.2.0 are not supported on Zen4 architectures. Building with
'--module-only --force' and injecting an LmodError into the modulefile.

== Updated build option 'module-only' to 'True'
== Updated build option 'force' to 'True'
...
== Setting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 to allow loading dependencies that otherwise throw an LmodError
  >> loading toolchain module: foss/2022b
  >> loading modules for build dependencies:
  >>  * pkgconfig/1.5.5-GCCcore-12.2.0-python
  >> loading modules for (runtime) dependencies:
  >>  * Python/3.10.8-GCCcore-12.2.0
  >>  * SciPy-bundle/2023.02-gfbf-2022b
  >>  * mpi4py/3.1.4-gompi-2022b
  >>  * HDF5/1.14.0-gompi-2022b
  >> defining build environment for foss/2022b toolchain
== Running post-prepare hook...
== Resetting rpath_override_dirs to original value: None
== Unsetting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220
...
== Running pre-module hook...
== Setting EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 in initial environment
  >> generating module file @
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/h5py/3.8.0-foss-2022b.lua
== Running post-module hook...
== Restored original build option 'module_only' to False
== Restored original build option 'force' to False
== Removing EESSI_IGNORE_LMOD_ERROR_ZEN4_GCC1220 in initial environment
== ... (took 1 secs)
...
== Results of the build can be found in the log file(s)
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.8.0-foss-2022b/easybuild/easybuild-h5py-3.8.0-20260108.171940.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 5 secs)
== Summary:
   * [SUCCESS] h5py/3.8.0-foss-2022b
== Temporary log file(s) /scratch-local/casparl.18163544/eb-37mpwjnj/easybuild-cuu9nk5g.log* have been removed.
== Temporary directory /scratch-local/casparl.18163544/eb-37mpwjnj has been removed.

Test loading module:

module load h5py/3.8.0-foss-2022b

Lmod has detected the following error:  EasyConfigs using toolchains based on GCCcore-12.2.0 are not supported for
the Zen4 architecture.
See
https://www.eessi.io/docs/known_issues/eessi-2023.06/#gcc-1220-and-foss-2022b-based-modules-cannot-be-loaded-on-zen4-architecture
While processing the following module(s):
    Module fullname        Module Filename
    ---------------        ---------------
    GCCcore/12.2.0         /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/GCCcore/12.2.0.lua
    GCC/12.2.0             /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/GCC/12.2.0.lua
    foss/2022b             /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/foss/2022b.lua
    h5py/3.8.0-foss-2022b  /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/modules/all/h5py/3.8.0-foss-2022b.lua

This is indeed the LmodError we expected

zen4+foss-2023b test

Environment:

module load EESSI/2023.06
module load EESSI-extend/2023.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --hooks eb_hooks.py h5py-3.11.0-foss-2023b.eb --rebuild

...
== ... (took < 1 sec)
  >> running shell command:
        bzip2 /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.11.0-foss-2023b/easybuild/easybuild-h5py-3.11.0-20260108.174439.log
        [started at: 2026-01-08 17:44:39]
        [working dir: /gpfs/home4/casparl/EESSI/software-layer-scripts]
        [output and state saved to /scratch-local/casparl.18163544/eb-_pcfey5c/run-shell-cmd-output/bzip2-ld1r6gr3]
  >> command completed: exit 0, ran in < 1s
== COMPLETED: Installation ended successfully (took 2 mins 53 secs)

Test loading module:

$ module load h5py/3.11.0-foss-2023b
$ echo $EBROOTH5PY
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen4/software/h5py/3.11.0-foss-2023b

This is what we expect - this installation should be unaltered.

CUDA Compute Capability 10.0 with CUDA Toolkit 10.6.0 (incompatible)

Environment:

module load EESSI/2025.06
module load EESSI-extend/2025.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --sourcepath=/home/casparl/.local/easybuild/sources --hooks eb_hooks.py --accept-eula-for=CUDA --cuda-compute-capabilities=10.0a CUDA-12.6.0.eb --rebuild

...
== Running pre-fetch hook...

WARNING: Requested a CUDA Compute Capability (['10.0a']) that is not supported by the CUDA toolkit version (12.6.0) used by this software. Switching to '--module-only --force' and injectiong an LmodError into the modulefile.

== Updated build option 'module-only' to 'True'
...
== Setting EESSI_IGNORE_CUDA_12_6_0_CC_10_0 to allow loading dependencies that otherwise throw an LmodError
== Running post-prepare hook...
== Unsetting EESSI_IGNORE_CUDA_12_6_0_CC_10_0
...
== Running pre-module hook...
== Setting EESSI_IGNORE_CUDA_12_6_0_CC_10_0 in initial environment
  >> generating module file @ /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/CUDA/12.6.0.lua
== Running post-module hook...
== Restored original build option 'module_only' to False
== Restored original build option 'force' to False
== Removing EESSI_IGNORE_CUDA_12_6_0_CC_10_0 in initial environment
...
== COMPLETED: Installation ended successfully (took 18 secs)
== Results of the build can be found in the log file(s) /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/easybuild/easybuild-CUDA-12.6.0-20260108.174859.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 20 secs)
== Summary:
   * [SUCCESS] CUDA/12.6.0

Test loading module:

$ module load CUDA/12.6.0
Lmod has detected the following error:  EasyConfigs using CUDA 12.6.0 or older are not supported for (all) requested Compute Capabilities: ['10.0a'].

While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    CUDA/12.6.0      /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all/CUDA/12.6.0.lua

This is what we expect.

CUDA Compute Capability 9.0 with CUDA Toolkit 10.6.0 (compatible)

Environment:

module load EESSI/2025.06
module load EESSI-extend/2025.06-easybuild

Build log:

# Use eb_hooks from feature branch:
eb --sourcepath=/home/casparl/.local/easybuild/sources --hooks eb_hooks.py --accept-eula-for=CUDA --cuda-compute-capabilities=10.0a CUDA-12.6.0.eb --rebuild

...
== COMPLETED: Installation ended successfully (took 2 mins 52 secs)
== Results of the build can be found in the log file(s) /home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0/easybuild/easybuild-CUDA-12.6.0-20260108.175337.log.bz2
== Running post-easyblock hook...

== Build succeeded for 1 out of 1 (total: 2 mins 54 secs)
== Summary:
   * [SUCCESS] CUDA/12.6.0

Test loading module:

$ module load CUDA/12.6.0
$ echo $EBROOTCUDA
/home/casparl/eessi/versions/2025.06/software/linux/x86_64/amd/zen2/software/CUDA/12.6.0

This is, again, what we expect, since there is support for CC9.0 in CUDA 12.6.0

…module-only if this is CUDA-12.6 based but targets CC100 or CC120
@casparvl casparvl marked this pull request as draft December 24, 2025 16:40
@casparvl
Copy link
Contributor Author

casparvl commented Jan 5, 2026

I think we can make this a little more powerful, by defining a lookup-table that, for a given CUDA Compute Capability, returns the CUDA version in which it was first supported, and the CUDA version in which it was last supported (or "99.9.9" or something, if it is still supported). Then, we do a semantic version comparison to figure out if we are in that range. If not, we add an informative error message to the module, and generate with --module-only.

Caspar van Leeuwen and others added 6 commits January 7, 2026 18:11
…ted configurations more generic. Then, also apply this to unsupported combinations of CUDA toolkit versions and requested CUDA compute capabilities. TODO: actually implement a function that checks this compatibility
…da_version actually returns 'None' if CUDA was not in the deps
…ed by the generic X_prepare_hook_unsupported_modules
…nvironment variables don't contain invalid characters like commas and periods. Add some warning messages if installing a module that's unsupported.
@casparvl casparvl changed the title Use module-only for Cuda 12.6 and CC100 or CC120 Use module-only when a CUDA Compute Capability is requested that is incompatible with the CUDA toolkit version used Jan 8, 2026
@casparvl casparvl marked this pull request as ready for review January 8, 2026 17:02
# Supported compute capabilities by CUDA toolkit version
# Obtained by installing all CUDAs from 12.0.0 to 13.1.0, then using:

# #!/bin/bash
Copy link
Contributor Author

@casparvl casparvl Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth leaving this here as a breadcrumb to future contributors, since we'll have to update this list occasionally and doing it manually is silly - especially if you want to add compatibility for a range of toolkit versions

# Clean cuda_cc of any suffixes like the 'a' in '9.0a'
# The regex expects one or more digits, a dot, one or more digits, and then optionally any number of characters
# It will strip all characters by only return the first capture group (the digits and dot)
cuda_cc = re.sub(r'^(\d+\.\d+)[a-zA-Z]*$', r'\1', cuda_cc)
Copy link
Contributor Author

@casparvl casparvl Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup table contains CCs in the format of 90, 100, etc, so no periods, and no suffixes. The CUDA compute capabilities passed to EasyBuild contain periods (for sure) and can contain suffixes. So to compare, we need to strip the suffix from EB's CUDA CC, and remove the ..

# Always trigger this one, regardless of ec.name
cpu_target = get_eessi_envvar('EESSI_SOFTWARE_SUBDIR')
if cpu_target == CPU_TARGET_ZEN4:
parse_hook_zen4_module_only(ec, eprefix)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now handled in the pre_module_hook_unsupported_modules.

print_msg(msg % (new_parallel, curr_parallel, session_parallel, self.name, cpu_target), log=self.log)


def pre_prepare_hook_unsupported_modules(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaces the specific pre_prepare_hook_ignore_zen4_gcccore1220_error we had before.

os.environ[unsup_mod.envvar] = "1"


def post_prepare_hook_unsupported_modules(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaces the post_prepare_hook_ignore_zen4_gcccore1220_error hook we had before

if cpu_target == CPU_TARGET_ZEN4:
pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs)
# Always trigger this, regardless of ec.name
pre_prepare_hook_unsupported_modules(self, *args, **kwargs)
Copy link
Contributor Author

@casparvl casparvl Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.

if cpu_target == CPU_TARGET_ZEN4:
post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs)
# Always trigger this, regardless of ec.name
post_prepare_hook_unsupported_modules(self, *args, **kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the new hook instead of the old one. All the logic to check if something is an unsupported module is now contained within is_unsupported_module, so no more use for checking the cpu_target.

print_msg("Changed toolchainopts for %s: %s", ec.name, ec['toolchainopts'])


def parse_hook_zen4_module_only(ec, eprefix):
Copy link
Contributor Author

@casparvl casparvl Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the LmodError to the modluafooter is now done in the generic pre_module_hook_unsupported_module hook



def is_unsupported_module(ec):
class UnsupportedModule(NamedTuple):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a named tuple so that we can have access to the environment variable name and error message through clearly named attributes. That's less sensitive to messing up compared to a regular tuple, where you'd have to remember what is stored in the first and what is stored in the second element of the tuple.


if cpu_target == CPU_TARGET_ZEN4 and is_gcccore_1220_based(ecname=ec.name, ecversion=ec.version, tcname=ec.toolchain.name, tcversion=ec.toolchain.version):
return EESSI_IGNORE_ZEN4_GCC1220_ENVVAR
# If this function was already called by an earlier hook, evaluation of whether this is an unsupported module was
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point in time, the is_unsupported_module function is called 6 or 7 times. Since it may become quite lengthy with lots of logic if we keep adding cases for modules that are unsupported, we want an early return for optimization in case this has already been evaluated before. We can easily do that by checking if either the EESSI_SUPPORTED_MODULE_ATTR or EESSI_UNSUPPORTED_MODULE_ATTR have been set.

If neither has been set, this is the first time we are evaluating this function and we should go through the full logic.

elif hasattr(self, EESSI_UNSUPPORTED_MODULE_ATTR):
return True

# Foss-2022b is not supported on Zen4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time we have unsupported modules, this function is the only one that needs changing: we simply add a case to it. A case typically has:

  • Logic (if statements) to determine if this is an unsupported module
  • Print a warning message to stdout to make it clear we're doing something out-of-the-ordinary in this installation
  • Define the LmodError message that should be embedded in the modulefile
  • Define the environment variable name that can be used to suppress the LmodError

ignore_lmoderror_envvar = is_unsupported_module(self)
if ignore_lmoderror_envvar:
if is_unsupported_module(self):
unsup_mod = getattr(self, EESSI_UNSUPPORTED_MODULE_ATTR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get the UnsupportedModule tuple, so we can use it to set the environment variable that suppresses the LmodError.

# Modules for dependencies are loaded in the prepare step. Thus, that's where we need this variable to be set
# so that the modules can be succesfully loaded without printing the error (so that we can create a module
# _with_ the warning for the current software being installed)
def pre_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by generic pre_prepare_hook_unsupported_modules

os.environ[EESSI_IGNORE_ZEN4_GCC1220_ENVVAR] = "1"


def post_prepare_hook_ignore_zen4_gcccore1220_error(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by generic post_prepare_hook_unsupported_modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant