I want to use the resample_iterations_idx functionality to bootstrap evaluation metrics of hindcasts. The challenge with huge datasets is the memory allocation when storing all the iteration samples.
I started with the Mean squared error skill score and tried to understand the memory consumption. The following small example script demonstrates the usage of the metric and the memory consumption:
import numpy
import xskillscore as xskill
import xarray as xr
from memory_profiler import profile
@profile
def AMSEss():
alpha=0.10
coordtime="time"
dimbs="iteration"
dsskill = xr.DataArray(data=15 + 2.1 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
dsref = xr.DataArray(data=15 + 0.15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
dsproof = xr.DataArray(data=15 + 2.0 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
bsp1 = xskill.resample_iterations_idx( dsproof - dsref, iterations=500, dim=coordtime,replace=True)
bsp2 = xskill.resample_iterations_idx( dsskill - dsref, iterations=500, dim=coordtime,replace=True)
p2divp1 = ( numpy.square( bsp2 ) ).mean(dim=coordtime) / \
( numpy.square( bsp1 ) ).mean(dim=coordtime)
amsessf = xr.where( (p2divp1 - 1.0)>0, (-1.0)/p2divp1 + 1.0,
p2divp1 - 1.0 ) # could be that nan's are not preserved
amsesslb = amsessf.quantile(alpha/2.0,dim=dimbs) #, keep_attrs=True)
amsessub = amsessf.quantile(1.0-alpha/2.0,dim=dimbs) #, keep_attrs=True)
del bsp1, bsp2, amsessf
BSms = xr.where( (amsesslb)>0, amsesslb*0.0 +1.0, amsesslb*0.0 )
BSms = BSms.where( (amsessub)>0, -1.0 )
BSms = BSms.where( (1-(amsesslb<=0)) == (1-(amsessub<=0)), 0.0 )
BSms = BSms.where( amsesslb.notnull().data )
BSms.to_netcdf("bsms_chunked.nc")
if __name__ == '__main__':
AMSEss()
Running the script from the linux console gives me the following:
Line # Mem usage Increment Occurences Line Contents
============================================================
7 189.4 MiB 189.4 MiB 1 @profile
8 def AMSEss():
9 189.4 MiB 0.0 MiB 1 alpha=0.10
10 189.4 MiB 0.0 MiB 1 coordtime="time"
11 189.4 MiB 0.0 MiB 1 dimbs="iteration"
12 201.0 MiB 11.6 MiB 1 dsskill = xr.DataArray(data=15 + 20 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
13 212.5 MiB 11.4 MiB 1 dsref = xr.DataArray(data=15 + 15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
14 223.9 MiB 11.4 MiB 1 dsproof = xr.DataArray(data=15 + 10 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
15
16 5946.6 MiB 5722.7 MiB 1 bsp1 = xskill.resample_iterations_idx( dsproof - dsref, iterations=500, dim=coordtime,replace=True)
17 11680.0 MiB 5733.4 MiB 1 bsp2 = xskill.resample_iterations_idx( dsskill - dsref, iterations=500, dim=coordtime,replace=True)
18
19 11718.2 MiB 0.0 MiB 2 p2divp1 = ( numpy.square( bsp2 ) ).mean(dim=coordtime) / \
20 11756.3 MiB 38.1 MiB 1 ( numpy.square( bsp1 ) ).mean(dim=coordtime)
21 11756.4 MiB 0.0 MiB 2 amsessf = xr.where( (p2divp1 - 1.0)>0, (-1.0)/p2divp1 + 1.0,
22 11794.5 MiB 38.1 MiB 1 p2divp1 - 1.0 ) # could be that nan's are not preserved
23 11756.4 MiB 0.0 MiB 1 amsesslb = amsessf.quantile(alpha/2.0,dim=dimbs) #, keep_attrs=True)
24 11756.4 MiB 0.0 MiB 1 amsessub = amsessf.quantile(1.0-alpha/2.0,dim=dimbs) #, keep_attrs=True)
25 274.2 MiB -11482.2 MiB 1 del bsp1, bsp2, amsessf
26 274.2 MiB 0.0 MiB 1 BSms = xr.where( (amsesslb)>0, amsesslb*0.0 +1.0, amsesslb*0.0 )
27 274.2 MiB 0.0 MiB 1 BSms = BSms.where( (amsessub)>0, -1.0 )
28 274.2 MiB 0.0 MiB 1 BSms = BSms.where( (1-(amsesslb<=0)) == (1-(amsessub<=0)), 0.0 )
29 274.2 MiB 0.0 MiB 1 BSms = BSms.where( amsesslb.notnull().data )
Thus, we need 5.7GB to store each of the iteration samples bsp1 and bsp2. That is consistent to the size of the arrays. However, often the climate datasets are much larger. So, I started working with dask arrays and changed the three lines:
dsskill = xr.DataArray(data=15 + 2.1 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
dsref = xr.DataArray(data=15 + 0.15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
dsproof = xr.DataArray(data=15 + 2.0 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
Now, the dask scheduler starts to collect all operations and performs the computation as the netcdf-file is written. However, the resampling_iterations_idx seems to refer indices which belong to the unchunked fields but not to the chunked fields:
File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/optimization.py", line 963, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/home/zmaw/m221054/.local/lib/python3.8/site-packages/xskillscore/core/resampling.py", line 193, in select_bootstrap_indices_ufunc
return np.moveaxis(x.squeeze()[idx.squeeze().transpose()], 0, -1)
IndexError: index 144 is out of bounds for axis 0 with size 10
Is there a way, to use the resampling functionality on dask arrays to save memory.? It is not clear to me if this is really parallelizable. As already commented in issue #221 by @dougiesquire (#221 (comment)), there is a problem with boolean indexing in numpy arrays which utilize dask arrays.
I want to use the
resample_iterations_idxfunctionality to bootstrap evaluation metrics of hindcasts. The challenge with huge datasets is the memory allocation when storing all the iteration samples.I started with the Mean squared error skill score and tried to understand the memory consumption. The following small example script demonstrates the usage of the metric and the memory consumption:
Running the script from the linux console gives me the following:
Thus, we need 5.7GB to store each of the iteration samples
bsp1andbsp2. That is consistent to the size of the arrays. However, often the climate datasets are much larger. So, I started working with dask arrays and changed the three lines:Now, the dask scheduler starts to collect all operations and performs the computation as the netcdf-file is written. However, the
resampling_iterations_idxseems to refer indices which belong to the unchunked fields but not to the chunked fields:Is there a way, to use the resampling functionality on dask arrays to save memory.? It is not clear to me if this is really parallelizable. As already commented in issue #221 by @dougiesquire (#221 (comment)), there is a problem with boolean indexing in numpy arrays which utilize dask arrays.