The problem
MultiBandStacBackendArray declares IndexingSupport.BASIC, so xarray collapses any vectorized point selection into a bounding-box slice before calling _raw_getitem. If you do:
da.sel(x=xr.DataArray(xs, dims="points"), y=xr.DataArray(ys, dims="points"))
lazycogs receives a single slice spanning the bounding rectangle of all your points, opens every COG overlapping that rectangle, and reads every pixel inside it. For sparse points over a large area this is very wasteful.
The per-point workaround — iterating and calling .sel() once per coordinate — is safe (each call resolves to a 1×1 spatial window) but doesn't batch COG opens. If N points land in the same source tile, that tile is opened N times.
This is a real gap for common workflows: training data extraction at labeled sites, validation against field measurements, time series at monitoring station locations.
Proposed solution
A lazycogs.sample() function (and/or a .lazycogs.sample() accessor method) that bypasses xarray's indexing machinery and batches reads by item footprint:
- Accept arrays of x/y coordinates in the array's CRS.
- Per time step: one DuckDB query using the bbox of all points combined.
- Per returned item: find which of the N points fall within its footprint, open the COG once, read the minimal window spanning those points, extract the specific pixel values.
- Apply the mosaic method across items at each point (same priority ordering as the regular mosaic).
Return a DataArray with a points dimension, shape (band, time, points).
Implementation notes
The main piece of new logic is _points_in_item: given a STAC item and a set of pixel coordinates, determine which points fall within the item's spatial footprint. Item geometries are available in the parquet via DuckDB — no extra I/O needed.
The COG read per item would use async_geotiff's existing window read: compute the bounding pixel window for all matching points within that item, read it once, then index out the specific pixel values using local offsets. This reuses _read_item_bands and the warp cache logic with minimal changes.
The FirstMethod early-exit optimization applies here too: once all N points have a valid value, remaining items are skipped.
Risks and unknowns
- Items with irregular or non-rectangular footprints (e.g., scene edge polygons) need a proper point-in-polygon test, not just a bbox check. Need to decide whether to use the item geometry from the parquet or fall back to bbox.
- Reprojection for points in a different CRS than the source COG adds complexity. The current
reproject_array approach works on rectangular windows — scattered pixel extraction in native coordinates may need a different path.
- The interface for dask parallelism over time steps is not yet designed. Simplest first cut is synchronous over time steps; dask support can come later.
The problem
MultiBandStacBackendArraydeclaresIndexingSupport.BASIC, so xarray collapses any vectorized point selection into a bounding-box slice before calling_raw_getitem. If you do:lazycogs receives a single slice spanning the bounding rectangle of all your points, opens every COG overlapping that rectangle, and reads every pixel inside it. For sparse points over a large area this is very wasteful.
The per-point workaround — iterating and calling
.sel()once per coordinate — is safe (each call resolves to a 1×1 spatial window) but doesn't batch COG opens. If N points land in the same source tile, that tile is opened N times.This is a real gap for common workflows: training data extraction at labeled sites, validation against field measurements, time series at monitoring station locations.
Proposed solution
A
lazycogs.sample()function (and/or a.lazycogs.sample()accessor method) that bypasses xarray's indexing machinery and batches reads by item footprint:Return a
DataArraywith apointsdimension, shape(band, time, points).Implementation notes
The main piece of new logic is
_points_in_item: given a STAC item and a set of pixel coordinates, determine which points fall within the item's spatial footprint. Item geometries are available in the parquet via DuckDB — no extra I/O needed.The COG read per item would use
async_geotiff's existing window read: compute the bounding pixel window for all matching points within that item, read it once, then index out the specific pixel values using local offsets. This reuses_read_item_bandsand the warp cache logic with minimal changes.The
FirstMethodearly-exit optimization applies here too: once all N points have a valid value, remaining items are skipped.Risks and unknowns
reproject_arrayapproach works on rectangular windows — scattered pixel extraction in native coordinates may need a different path.