Skip to content

Ignore ROCm-LLVM on aarch64#223

Open
zerefwayne wants to merge 3 commits intoEESSI:mainfrom
zerefwayne:rocm-aarch
Open

Ignore ROCm-LLVM on aarch64#223
zerefwayne wants to merge 3 commits intoEESSI:mainfrom
zerefwayne:rocm-aarch

Conversation

@zerefwayne
Copy link
Copy Markdown
Contributor

ROCm-LLVM 6.4.1 is not supported on aarch64 family of CPUs.

See: EESSI/software-layer#1473 (comment)

Copy link
Copy Markdown
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we do this? If it is not supported at all for Arm then why create modules?

The EESSI module will not expose a MODULEPATH that does not exist. What I would suggest is that we instead print a message in the EESSI module when we see aarch64 and an AMD GPU.

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

What I mean is, this shouldn't be a warning but an error.

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

My assumption here is ROCm-LLVM is being installed under an accelerator path. If that is not the case then I think we should discuss that choice.

Comment thread eb_hooks.py
msg += "You can override this behaviour by setting the EESSI_OVERRIDE_ROCM_VERSION_CHECK environment variable."
print_warning(msg)
var=EESSI_IGNORE_AARCH64_ROCMLLVM641_ENVVAR
setattr(self, EESSI_UNSUPPORTED_MODULE_ATTR, UnsupportedModule(envvar=var, errmsg=errmsg))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define a sensible errmsg before this line. Note that this is the errmsg that gets printed to the end user trying to load the module. Thus, it should make sense to such an end-user...

@zerefwayne zerefwayne requested a review from casparvl May 6, 2026 19:59
@casparvl
Copy link
Copy Markdown
Contributor

casparvl commented May 6, 2026

My assumption here is ROCm-LLVM is being installed under an accelerator path. If that is not the case then I think we should discuss that choice.

It is.

What I mean is, this shouldn't be a warning but an error.

You get a warning at install time that this will trigger a --module-only install. The module that this will install will raise an LmodError when you try to load it. The is_unsupported_module function in eb_hooks.py provides standard functionality to put such modules in place. I.e. it triggers a --module-only installation, and appends an LmodError to the module with a message that is configurable. This makes it very easy to add other cases of 'unsupported modules'. Note that the original case for which we implemented something like this (though not in this generic form yet) was the fact that zen4 didn't support the GCCcore 12.2-based toolchains.

Why would we do this? If it is not supported at all for Arm then why create modules?

The EESSI module will not expose a MODULEPATH that does not exist. What I would suggest is that we instead print a message in the EESSI module when we see aarch64 and an AMD GPU.

This is another option. I'm ok with this as well. It depends a bit on your philosophy: if we want all module environments to look identical in terms of which modules are present, we should take the is_unsupported_module approach. The downside is that I'm not sure what we look at for our software overview / API (and maybe you can tell me more about this :)): if we look at the modules, then it would appear that these installations are present for AArch64+AMD GPU combinations, while actually they are not.

One other reason I'd have for opting for the is_unsupported_module approach is that it is more consistent: we did the same thing for GCCcore-12.2.0-based stuff not being supported on zen4. We could have added a generic warning in the EESSI module for that as well, and simply never built those modules for the zen4-target - but we decided to put modules in place that print errors. I'm not sure why the current situation would be different?

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

This is not the same as the zen4 case, that was a CPU toolchain that didn't work on that CPU. In this case this is an accelerator that will will never be matched with a CPU. Just like A64FX+CUDA, there is no reason for those modules to exist as they will never work

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

Think of it in terms of what usable modules are in the MODULEPATH. For the Zen4 case, it's a CPU and you have all the other toolchains that work just fine, so it made sense to fill the holes. For aarch64+ROCm there would be no usable modules at all, so there really isn't any point in going to the effort of creating that MODULEPATH (and the EESSI module handles that absence just fine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants