Skip to content

Dart lowres cmeps#657

Open
kdraeder wants to merge 3 commits into
ESCOMP:mainfrom
kdraeder:DART_lowres_cmeps
Open

Dart lowres cmeps#657
kdraeder wants to merge 3 commits into
ESCOMP:mainfrom
kdraeder:DART_lowres_cmeps

Conversation

@kdraeder
Copy link
Copy Markdown
Contributor

This issue stems from cime #4933, which is about developing a large ensemble test
motivated by DART applications.

Because of the large ensemble, the testing will be more managable
if it uses a coarse resolution grid. An ne3 grid is available for CAM and CTSM,
and now a ~10 degree resolution is available in MOM6 (MOM_interface #311).
These have been combined into a new CESM grid and used in ERI and MCC tests,
which also use a new testmod tailored to DART needs.
I'm open to suggestions for a shorter testmod name,
but @billsacks and I feel that it will be helpful to have DART in it.
This grid (especially the MOM6 grid) limits the tasks/instance to 12
(6 for MOM, 6 for the other components).

An MCC test for a small ensemble passes all test stages
(/glade/work/raeder/Exp/CESM+DART_testing/MCC_cG.ne3pg3_10deg.B_DART.lowres)
but ensembles which require more than 1 node mostly fail
with an error in cmeps/cesm/driver/ensemble_driver.F90.
This seems to arise from smaller ensembles fitting into a single (develop qeueu) node,
where the exact number of processors needed is assigned to them,
while larger ensembles need multiple (cpu/main) nodes
and more processors are assigned to the job than are requested.
For example, 40 instances request 12 x 40 = 480 processors.
This requires 4 nodes x 128 = 512 processors are assigned.
This difference causes an error:
PetCount ( 512) - Async IOtasks ( 0) must be evenly divisable by number of members ( 40).

When the check for this error is removed, the job goes farther,
but hangs just before the time stepping in CAM. This can be prevented by choosing MAX_TASKS_PER_NODE in a way that prevents any instance from being laid out across 2 nodes.
The changes required to do this are beyond the scope of this PR,
and are handled in CESM #398.

Description of changes

Commenting out the consistency check between PetCount and number_of_members,
if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, number_of_members) .ne. 0) then
allows the test to proceed.
I could not trace the variables back through ESMF to figure out an if-test
which would handle this situation, and developers I talked to weren't certain that it's essential,
so my temporary solution is to comment out the test, without removing it.

Specific notes

Contributors other than yourself, if any:
@billsacks @jedwards4b

CMEPS Issues Fixed (include github issue #): #461
This is also essential for issues in other components:
ESMCI/cime #4933 (overview issue)
CESM PR #398
ESMCI/ccs_config PR #285
NCAR/MOM6 #413

Are changes expected to change answers? (specify if bfb, different at roundoff, more substantial)
This is not expected to change answers in tests which ran successfully before this change.
Some tests which would not run before will now run. It's possible that some of those should not run,
but I have not looked into those.

Any User Interface Changes (namelist or namelist defaults changes)?
Users who want to run ERI or MCC tests with an ensemble which can fit some,
but not all, instances on 1 node, will need to include the test_mods developed in CESM #398
and follow the instructions for setting MAX_TASKS_PER_NODE.

Testing performed

Please describe the tests along with the target model and machine(s)
If possible, please also added hashes that were used in the testing.
Extensive testing (development) of ERI and MCC tests were conducted in a version of cesm3_0_alpha08d,
modified to enable the 10-degree MOM6 grid, using a BHIST compset, on derecho.
The relevant changes (multiple components) were imported to the cesm3_0_alpha09a tag
and tested in cases in /glade/work/raeder/Exp/CESM+DART_testing:

  1. 40 instance MCC; alpha09a_MCC_cG_C40.ne3pg3_10deg.B_DART.MAX_TASKS120
  2. 2 instance ERI; alpha09a_ERI_cG_C2_Ld8.ne3pg3_10deg.B_DART.aux_lowres

kdraeder and others added 3 commits March 27, 2026 16:12
Changes are needed in multiple components: cesm, cime, cmeps, mom/MOM6.
The branches are labeled with DART_lowres_{component}.

cesm/driver/ensemble_driver.F90
   Remove PETcount versus NINST test to let middle-sized tests work.
     if(modulo(PetCount-pio_asyncio_ntasks*number_of_members, &
        number_of_members) .ne. 0) then
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant