Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
df1105f
initial example of two year data
rogerkuou Feb 27, 2026
fef01cf
update examples notebook
rogerkuou Feb 27, 2026
2b3ac47
add example training scritps
rogerkuou Feb 27, 2026
2bdf8b5
add example slurm file
rogerkuou Feb 27, 2026
1159fbc
update fig dir
rogerkuou Feb 27, 2026
3e2c4b4
add README
rogerkuou Feb 27, 2026
994d36b
Merge branch 'main' into 25_test_two_year_data
rogerkuou Feb 27, 2026
8cd1c8f
Apply suggestions from code review
rogerkuou Mar 18, 2026
fe8f024
fix conflicts
rogerkuou Mar 18, 2026
a3ba05d
separate training and inference
rogerkuou Mar 18, 2026
3c99673
update model exportation with checkpoint
rogerkuou Mar 18, 2026
2b4c7c5
add inference scripts
rogerkuou Mar 18, 2026
9ed2e00
use logging to replace print
rogerkuou Mar 18, 2026
1de1cfb
update example slurm scripts
rogerkuou Mar 18, 2026
efa17a1
force example notebook to be identical as main
rogerkuou Mar 26, 2026
25297dc
Apply suggestions from code review
rogerkuou Mar 26, 2026
6b04a27
revert changes in model file
rogerkuou Mar 26, 2026
ba0408f
remove inference script
rogerkuou Mar 26, 2026
74a180c
maintain the same config in example script as example notebook
rogerkuou Mar 26, 2026
eeb29ff
update the training loop
rogerkuou Mar 26, 2026
8481c08
update logger and log file
rogerkuou Mar 26, 2026
baba960
update the training script and slurm file
rogerkuou Mar 27, 2026
e89d4e4
document the efficiency calculation in README
rogerkuou Mar 27, 2026
3a5dbd6
add an example slurm log
rogerkuou Mar 27, 2026
4fd95d9
Merge branch 'main' into 25_test_two_year_data
rogerkuou Apr 20, 2026
fb94642
add docstring to datasets
rogerkuou Apr 20, 2026
17a8294
update example training script with train_monthly_model function
rogerkuou Apr 20, 2026
13714f9
update slurm
rogerkuou Apr 20, 2026
228c18a
update training script
rogerkuou Apr 20, 2026
8898cbf
add logger to training script
rogerkuou Apr 23, 2026
c799baf
enable printing in slurm logv files
rogerkuou Apr 23, 2026
92fbdeb
add constraints on the lattitude
rogerkuou Apr 23, 2026
3b9cae6
add slurm output file of a subset
rogerkuou Apr 23, 2026
7494f4a
add a full SLURM log file
rogerkuou Apr 23, 2026
a13472b
update readme
rogerkuou Apr 23, 2026
9cbd7e6
update slurm job time to 4hrs default
rogerkuou Apr 23, 2026
e9996e4
Update scripts/README.md
rogerkuou Apr 29, 2026
6bd6391
update longitude constraint
rogerkuou May 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions climanet/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,23 @@ def __init__(
spatial_dims: Tuple[str, str] = ("lat", "lon"),
patch_size: Tuple[int, int] = (16, 16), # (lat, lon)
):
"""Initialize the dataset with daily and monthly data, land mask, and patching parameters.

Parameters
----------
daily_da : xr.DataArray
Daily data array.
monthly_da : xr.DataArray
Monthly data array.
land_mask : xr.DataArray, optional
Land mask array, by default None
time_dim : str, optional
Name of the time dimension, by default "time"
spatial_dims : Tuple[str, str], optional
Names of the spatial dimensions, by default ("lat", "lon")
patch_size : Tuple[int, int], optional
Size of the patches, by default (16, 16)
"""
self.spatial_dims = spatial_dims
self.patch_size = patch_size
self.daily_da = daily_da
Expand Down
84 changes: 84 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Example of training a SpatioTemporalModel on HPC

## Folder structure

- example_training.py: example training script
- example.slurm: example SLURM script to execute the training script on SLURM system
- eso4clima_24438134_subset.out: example SLURM job output file of an execution on a subset of the global dataset. The dataset has two years of data (2020-2021) and the spatial coverage is from 30S to 30N and from 30W to 30E.
- eso4clima_24449471_full.out: example SLURM job output file of an execution on the full dataset, two years of data (2020-2021) and almost global coverage (from 80S to 80N and from 179.99W to 179.99E). The training only executed for 1 hour and cuted off by SLURM time limit.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why job was cancelled in 1 hour while the #SBATCH --time=04:00:00?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example slurm job now does not 100% match the slurm log file. Indeed as you mentioned in the general comment, I set 1hr because I think it doesnt make sense to complete the full two-year training at this stage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked two months of data and it looks like the values at lon = 179.9 are NaN, which might have happened during data processing. I didn’t find any issues with lat > 80 though, since the target still has data there. I made issue #41

Copy link
Copy Markdown
Collaborator Author

@rogerkuou rogerkuou Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right. See my comments in #41

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the eso4clima_24449471_full.out file, why example_subset.slurm is used, see:

* Command          : /home/b/b383704/eso4clima/train_twoyears/
*                    example_subset.slurm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the full run is techically still a very large subest, I modified the parameter in the subest python script, which is called by example_subset.slurm. When pushing the example scripts here I renamed the slurm and python script files

Copy link
Copy Markdown
Member

@SarahAlidoost SarahAlidoost Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the eso4clima_24449471_full.out file, there is the warning UserWarning: Patch size (120, 120) does not evenly divide image dimensions (H=720, W=640). Uncovered pixels: 0 in height, 40 in width. Consider adjusting patch_size or image dimensions for full coverage. warnings.warn(. I remember we have discussed this offline. I made #42


## Execute training tasks on SLURM system

1. Make a working directory

```sh
mkdir training
cd training
```

2. Clone this repo
```sh
git clone git@github.com:ESMValGroup/ClimaNet.git
```

3. Install uv for dependency management. Se [uv doc](https://docs.astral.sh/uv/getting-started/installation/).

4. Create a venv and install Python dependencies using uv
```sh
cd ClimaNet
```

```
uv sync
```

A `.venv` dir will appear

5. Copy the python script and slurm script into the working dir:

```sh
cp ClimaNet/scripts/example* .
```

6. Config `example.slurm`, in the `source ...` line, make sure the venv just created is activated.
Note that the account is the ESO4CLIMA project account, which is shared by multiple users.

7. Config `example.py`, make sure the path of input data and land mask data is correct.

8. Execute the SLURM job
```sh
sbatch example.slurm
```

## Check the efficiency of resource usage

In the SLURM job output, you can find the line like this:

```
==== Slurm accounting summary 23743544 ====
JobID|NTasks|AveCPU|AveRSS|MaxRSS|MaxVMSize|TRESUsageInAve|TRESUsageInMax
23743544.extern|1|00:00:00|856K|3752K|641376K|cpu=00:00:00,energy=0,fs/disk=2332,mem=856K,pages=2,vmem=217160K|cpu=00:00:00,energy=0,fs/disk=2332,mem=3752K,pages=2,vmem=641376K
23743544.batch|1|04:21:01|11964K|4102096K|37743716K|cpu=04:21:01,energy=0,fs/disk=22293117907,mem=11964K,pages=19,vmem=356724K|cpu=04:21:01,energy=0,fs/disk=22293117907,mem=4102096K,pages=7711,vmem=37743716K
```

Which gives some information about the resource usage at the end of the job.

To have a better understanding of the efficiency of resource usage, you can run the following command after the job is finished:

```sh
sacct -j <slurm_job_id> \
--format=JobID,JobName%30,Partition,AllocCPUS,Elapsed,TotalCPU,MaxRSS,State,ExitCode \
--parsable2 >> "eso4clima_<slurm_job_id>.out"

```

This will output the resource usage information and add it to the slurm job output file. After running this you can find the line like this in the output file:

```
JobID|JobName|Partition|AllocCPUS|Elapsed|TotalCPU|MaxRSS|State|ExitCode
23743544|eso4clima|compute|256|00:02:44|04:21:01||COMPLETED|0:0
23743544.batch|batch||256|00:02:44|04:21:01|4102096K|COMPLETED|0:0
23743544.extern|extern||256|00:02:44|00:00.001|3752K|COMPLETED|0:0
```

The the efficiency of resource usage can be calculated as `TotalCPU / AllocCPUS * Elapsed Time`. In the example above, the CPU time is `04:21:01`, the allocated CPU is `256`, and the elapsed time is `00:02:44`, so the resource usage is `4:21:01 / 256 * 00:02:44 = 0.37`.
Loading