resample and boolean indexing with dask-arrays

I want to use the `resample_iterations_idx` functionality to bootstrap evaluation metrics of hindcasts. The challenge with huge datasets is the memory allocation when storing all the iteration samples.
I started with the Mean squared error skill score and tried to understand the memory consumption. The following small example script demonstrates the usage of the metric and the memory consumption:

```python
import numpy
import xskillscore as xskill
import xarray as xr
from memory_profiler import profile

@profile
def AMSEss():
  alpha=0.10
  coordtime="time"
  dimbs="iteration"
  dsskill = xr.DataArray(data=15 + 2.1 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
  dsref   = xr.DataArray(data=15 + 0.15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
  dsproof = xr.DataArray(data=15 + 2.0 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])

  bsp1 = xskill.resample_iterations_idx( dsproof - dsref, iterations=500, dim=coordtime,replace=True)
  bsp2 = xskill.resample_iterations_idx( dsskill - dsref, iterations=500, dim=coordtime,replace=True)

  p2divp1 = ( numpy.square( bsp2 ) ).mean(dim=coordtime) / \
    ( numpy.square( bsp1 ) ).mean(dim=coordtime)
  amsessf = xr.where( (p2divp1 - 1.0)>0, (-1.0)/p2divp1 + 1.0,
    p2divp1 - 1.0 ) # could be that nan's are not preserved
  amsesslb = amsessf.quantile(alpha/2.0,dim=dimbs) #, keep_attrs=True)
  amsessub = amsessf.quantile(1.0-alpha/2.0,dim=dimbs) #, keep_attrs=True)
  del bsp1, bsp2, amsessf
  BSms = xr.where( (amsesslb)>0, amsesslb*0.0 +1.0, amsesslb*0.0 )
  BSms = BSms.where( (amsessub)>0, -1.0 )
  BSms = BSms.where( (1-(amsesslb<=0)) == (1-(amsessub<=0)), 0.0 )
  BSms = BSms.where( amsesslb.notnull().data )
  BSms.to_netcdf("bsms_chunked.nc")

if __name__ == '__main__':
  AMSEss()
```

Running the script from the linux console gives me the following:
```
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     7    189.4 MiB    189.4 MiB           1   @profile
     8                                         def AMSEss():
     9    189.4 MiB      0.0 MiB           1     alpha=0.10
    10    189.4 MiB      0.0 MiB           1     coordtime="time"
    11    189.4 MiB      0.0 MiB           1     dimbs="iteration"
    12    201.0 MiB     11.6 MiB           1     dsskill = xr.DataArray(data=15 + 20 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
    13    212.5 MiB     11.4 MiB           1     dsref   = xr.DataArray(data=15 + 15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
    14    223.9 MiB     11.4 MiB           1     dsproof = xr.DataArray(data=15 + 10 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"])
    15                                         
    16   5946.6 MiB   5722.7 MiB           1     bsp1 = xskill.resample_iterations_idx( dsproof - dsref, iterations=500, dim=coordtime,replace=True)
    17  11680.0 MiB   5733.4 MiB           1     bsp2 = xskill.resample_iterations_idx( dsskill - dsref, iterations=500, dim=coordtime,replace=True)
    18                                         
    19  11718.2 MiB      0.0 MiB           2     p2divp1 = ( numpy.square( bsp2 ) ).mean(dim=coordtime) / \
    20  11756.3 MiB     38.1 MiB           1       ( numpy.square( bsp1 ) ).mean(dim=coordtime)
    21  11756.4 MiB      0.0 MiB           2     amsessf = xr.where( (p2divp1 - 1.0)>0, (-1.0)/p2divp1 + 1.0, 
    22  11794.5 MiB     38.1 MiB           1       p2divp1 - 1.0 ) # could be that nan's are not preserved
    23  11756.4 MiB      0.0 MiB           1     amsesslb = amsessf.quantile(alpha/2.0,dim=dimbs) #, keep_attrs=True)
    24  11756.4 MiB      0.0 MiB           1     amsessub = amsessf.quantile(1.0-alpha/2.0,dim=dimbs) #, keep_attrs=True)
    25    274.2 MiB -11482.2 MiB           1     del bsp1, bsp2, amsessf
    26    274.2 MiB      0.0 MiB           1     BSms = xr.where( (amsesslb)>0, amsesslb*0.0 +1.0, amsesslb*0.0 )
    27    274.2 MiB      0.0 MiB           1     BSms = BSms.where( (amsessub)>0, -1.0 )
    28    274.2 MiB      0.0 MiB           1     BSms = BSms.where( (1-(amsesslb<=0)) == (1-(amsessub<=0)), 0.0 )
    29    274.2 MiB      0.0 MiB           1     BSms = BSms.where( amsesslb.notnull().data )
```
Thus, we need **5.7GB** to store each of the iteration samples `bsp1` and `bsp2`. That is consistent to the size of the arrays. However, often the climate datasets are much larger. So, I started working with dask arrays and changed the three lines:

```python
  dsskill = xr.DataArray(data=15 + 2.1 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
  dsref   = xr.DataArray(data=15 + 0.15 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
  dsproof = xr.DataArray(data=15 + 2.0 * numpy.random.randn(100, 100, 150),dims=["x", "y", "time"]).chunk({'time':10})
```

Now, the dask scheduler starts to collect all operations and performs the computation as the netcdf-file is written. However, the `resampling_iterations_idx` seems to refer indices which belong to the unchunked fields but not to the chunked fields:
```
  File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/optimization.py", line 963, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/core.py", line 151, in get
    result = _execute_task(task, cache)
  File "/sw/spack-rhel6/miniforge3-4.9.2-3-Linux-x86_64-pwdbqi/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/zmaw/m221054/.local/lib/python3.8/site-packages/xskillscore/core/resampling.py", line 193, in select_bootstrap_indices_ufunc
    return np.moveaxis(x.squeeze()[idx.squeeze().transpose()], 0, -1)
IndexError: index 144 is out of bounds for axis 0 with size 10
```

Is there a way, to use the resampling functionality on dask arrays to save memory.? It is not clear to me if this is really parallelizable. As already commented in issue #221 by @dougiesquire (https://github.com/xarray-contrib/xskillscore/pull/221#issuecomment-761374062), there is a problem with boolean indexing in numpy arrays which utilize dask arrays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample and boolean indexing with dask-arrays #356

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

resample and boolean indexing with dask-arrays #356

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions