Skip to content

Allow different overrides for compat (i.e. CPU family) and modules (i.e. SOFTWARE_SUBDIR)#222

Open
casparvl wants to merge 1 commit intoEESSI:mainfrom
casparvl:allow_cpu_family_override_and_respect_riscv_override
Open

Allow different overrides for compat (i.e. CPU family) and modules (i.e. SOFTWARE_SUBDIR)#222
casparvl wants to merge 1 commit intoEESSI:mainfrom
casparvl:allow_cpu_family_override_and_respect_riscv_override

Conversation

@casparvl
Copy link
Copy Markdown
Contributor

@casparvl casparvl commented May 6, 2026

This faciliates e.g. checking of the available RISCV modules on an X86_64 host.

This is useful for e.g. running the test suite on an x86_64-based login node, when targetting RISCV-based batch nodes. The reason is that the test suite generates tests based on the available modules, so on the login node (where the reframe runtime runs), we want to see the RISCV modules.

This change allows doing

EESSI_CPU_FAMILY_OVERRIDE=x86_64 EESSI_VERSION_OVERRIDE=2025.06-001 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic module load EESSI/2025.06

on an X86_64-based host, and make the RISC-V module stack available:

$ module av

--------------------------------------------------------- /cvmfs/dev.eessi.io/riscv/versions/2025.06-001/software/linux/riscv64/generic/modules/all ---------------------------------------------------------
   Abseil/20250512.1-GCCcore-14.3.0              FFTW.MPI/3.3.10-gompi-2025a                         libidn2/2.3.8-GCCcore-14.3.0           (D)    Perl/5.38.0
...

Admittedly, the usage is a bit restricted since this is really only useful if module stacks are NOT in sync and if the CPU family for the target of interest is different from the one on the host. Then again, it is an override: nothing changes in the default behavior, and you don't set it unless you have very good reasons to. It simply allows us extra flexibility when testing things.

Issue came up in EESSI/dev.eessi.io-riscv#34 (comment) when trying to set up period tests on the RISC-V cluster.

…useful for e.g. running the test suite on an x86_64-based login node, when targetting RISCV-based batch nodes. The reason is that the test suite generates tests based on the available modules, so on the login node (where the reframe runtime runs), we want to see the RISCV modules. This changes allows that when someone does 'EESSI_CPU_FAMILY_OVERRIDE=x86_64 EESSI_VERSION_OVERRIDE=2025.06-001 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic module load EESSI/2025.06'
@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 6, 2026

Looks good, can you just confirm that things still work fine without the override set (just in case).

@casparvl
Copy link
Copy Markdown
Contributor Author

casparvl commented May 7, 2026

Good point, let me double check :D

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented May 7, 2026

Now that I think about it, we should add CI for this as we do for many of the other EESSI_* environment variables. Something similar to https://github.com/EESSI/software-layer-scripts/blob/main/.github/workflows/tests_eessi_module.yml#L276 (but for 2025.06 only). You'll need the setting

          cvmfs_repositories: software.eessi.io,dev.eessi.io

and a test would be something like

source /cvmfs/...
module purge  # Unload EESSI
EESSI_CPU_FAMILY_OVERRIDE=x86_64 EESSI_VERSION_OVERRIDE=2025.06-001 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic module load EESSI/2025.06  # Reload with new setting
module av  # if that works we know out compat layer is fine
[[ "MODUELPATH" == *"riscv"* ]]  # check we have a riscv MODULEPATH

@casparvl
Copy link
Copy Markdown
Contributor Author

casparvl commented May 7, 2026

Works

[casparl@int4 ~]$ cd EESSI/software-layer-scripts/init/modules/
[casparl@int4 modules]$ module unuse /sw/noarch/environment
[casparl@int4 modules]$ module use $PWD
[casparl@int4 modules]$ module load EESSI/2025.06
Module for EESSI/2025.06 loaded successfully
[casparl@int4 modules]$ which uname
/cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/usr/bin/uname
[casparl@int4 modules]$ echo $MODULEPATH
/cvmfs/software.eessi.io/host_injections/2025.06/software/linux/x86_64/amd/zen2/modules/all:/cvmfs/software.eessi.io/versions/2025.06/software/linux/x86_64/amd/zen2/modules/all:/home/casparl/EESSI/software-layer-scripts/init/modules

However, I do see this when I'm using the override and then unloading:

[casparl@int4 modules]$ EESSI_CPU_FAMILY_OVERRIDE=x86_64 EESSI_VERSION_OVERRIDE=2025.06-001 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic module load EESSI/2025.06
This EESSI production version only provides a RISC-V compatibility layer,
software installations are provided by the EESSI development repository at /cvmfs/dev.eessi.io/riscv.

Module for EESSI/2025.06 loaded successfully
[casparl@int4 modules]$ module unload EESSI/2025.06
Lmod Warning:  Software directory check for the detected architecture failed
While processing the following module(s):
    Module fullname  Module Filename
    ---------------  ---------------
    EESSI/2025.06    /home/casparl/EESSI/software-layer-scripts/init/modules/EESSI/2025.06.lua

I'm not quite sure why. Seems to come from here

LmodError("Software directory check for the detected architecture failed")
but... that's an LmodError, and here it's printed as a warning - that seems odd. I also don't really understand what's different on unload here that this suddenly produces this error.

@casparvl
Copy link
Copy Markdown
Contributor Author

casparvl commented May 7, 2026

[casparl@tcn166 modules]$ EESSI_CPU_FAMILY_OVERRIDE=x86_64 EESSI_VERSION_OVERRIDE=2025.06-001 EESSI_SOFTWARE_SUBDIR_OVERRIDE=riscv64/generic module unload
[casparl@tcn166 modules]$

Works fine. Makes sense: without the overrides, it has no idea it should look for the riscv repo when it checks the existance of the directory.

@casparvl
Copy link
Copy Markdown
Contributor Author

casparvl commented May 7, 2026

Ok, so all is looking fine then. I'll built it, if you're ok with it as well, you can deploy @ocaisa

@casparvl
Copy link
Copy Markdown
Contributor Author

casparvl commented May 7, 2026

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented May 7, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2023.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.05/pr_222/155134

date job status comment
May 07 08:35:57 UTC 2026 submitted job id 155134 awaits release by job manager
May 07 08:36:42 UTC 2026 released job awaits launch by Slurm scheduler
May 07 08:46:16 UTC 2026 running job 155134 is running
May 07 08:49:01 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-155134.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-17781435360.tar.zstsize: 0 MiB (4503 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/init/modules/EESSI/2023.06.lua
May 07 08:49:01 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86-64-zen2+default
P: perf: 446.683 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86-64-zen2+default
P: perf: 451.536 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86-64-zen2+default
P: latency: 2.79 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86-64-zen2+default
P: latency: 2.6 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86-64-zen2+default
P: latency: 6.13 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86-64-zen2+default
P: latency: 5.62 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86-64-zen2+default
P: latency: 0.87 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86-64-zen2+default
P: latency: 0.81 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6430.43 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6408.88 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-155134.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws Bot commented May 7, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.05/pr_222/155135

date job status comment
May 07 08:36:01 UTC 2026 submitted job id 155135 awaits release by job manager
May 07 08:36:39 UTC 2026 released job awaits launch by Slurm scheduler
May 07 08:44:57 UTC 2026 running job 155135 is running
May 07 08:47:44 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-155135.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-17781434970.tar.zstsize: 0 MiB (4503 bytes)
entries: 1
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
2025.06/init/modules/EESSI/2025.06.lua
May 07 08:47:44 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/5) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/22Jul2025-foss-2024a-kokkos %scale=1_node /ade8cad7 @BotBuildTests:x86-64-zen2+default
P: perf: 442.369 timesteps/s (r:0, l:None, u:None)
[ OK ] (2/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen2+default
P: latency: 1.31 us (r:0, l:None, u:None)
[ OK ] (3/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen2+default
P: latency: 2.03 us (r:0, l:None, u:None)
[ OK ] (4/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen2+default
P: latency: 0.2 us (r:0, l:None, u:None)
[ OK ] (5/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 7935.81 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 5/5 test case(s) from 5 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-155135.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants