Skip to content

Add sample() method for efficient point extraction from scattered coordinates #16

@hrodmn

Description

@hrodmn

The problem

MultiBandStacBackendArray declares IndexingSupport.BASIC, so xarray collapses any vectorized point selection into a bounding-box slice before calling _raw_getitem. If you do:

da.sel(x=xr.DataArray(xs, dims="points"), y=xr.DataArray(ys, dims="points"))

lazycogs receives a single slice spanning the bounding rectangle of all your points, opens every COG overlapping that rectangle, and reads every pixel inside it. For sparse points over a large area this is very wasteful.

The per-point workaround — iterating and calling .sel() once per coordinate — is safe (each call resolves to a 1×1 spatial window) but doesn't batch COG opens. If N points land in the same source tile, that tile is opened N times.

This is a real gap for common workflows: training data extraction at labeled sites, validation against field measurements, time series at monitoring station locations.

Proposed solution

A lazycogs.sample() function (and/or a .lazycogs.sample() accessor method) that bypasses xarray's indexing machinery and batches reads by item footprint:

  1. Accept arrays of x/y coordinates in the array's CRS.
  2. Per time step: one DuckDB query using the bbox of all points combined.
  3. Per returned item: find which of the N points fall within its footprint, open the COG once, read the minimal window spanning those points, extract the specific pixel values.
  4. Apply the mosaic method across items at each point (same priority ordering as the regular mosaic).

Return a DataArray with a points dimension, shape (band, time, points).

Implementation notes

The main piece of new logic is _points_in_item: given a STAC item and a set of pixel coordinates, determine which points fall within the item's spatial footprint. Item geometries are available in the parquet via DuckDB — no extra I/O needed.

The COG read per item would use async_geotiff's existing window read: compute the bounding pixel window for all matching points within that item, read it once, then index out the specific pixel values using local offsets. This reuses _read_item_bands and the warp cache logic with minimal changes.

The FirstMethod early-exit optimization applies here too: once all N points have a valid value, remaining items are skipped.

Risks and unknowns

  • Items with irregular or non-rectangular footprints (e.g., scene edge polygons) need a proper point-in-polygon test, not just a bbox check. Need to decide whether to use the item geometry from the parquet or fall back to bbox.
  • Reprojection for points in a different CRS than the source COG adds complexity. The current reproject_array approach works on rectangular windows — scattered pixel extraction in native coordinates may need a different path.
  • The interface for dask parallelism over time steps is not yet designed. Simplest first cut is synchronous over time steps; dask support can come later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions