Skip to content

geotiff: split __init__.py into per-backend modules with shared validation #1813

@brendancol

Description

@brendancol

Reason or Problem

xrspatial/geotiff/__init__.py is over 4000 lines and mixes:

  • public API entry points (open_geotiff, read_vrt, read_geotiff_dask, read_geotiff_gpu, to_geotiff, write_vrt, write_geotiff_gpu)
  • dask backend orchestration
  • GPU backend orchestration and host/device transitions
  • writer dispatch and CRS resolution
  • fallback policy (on_gpu_failure, missing-source handling)
  • coordinate construction from GeoTransform
  • dtype / window / band validation

Several recent parity bugs trace back to drift between branches inside this file:

The same kwarg or behaviour needs to be wired into 3+ places that have grown apart. Each parity fix is small individually, but the rate at which they arrive shows the file is past the size where the structure helps.

Proposal

Split __init__.py into smaller modules organised by responsibility:

  • _coords.py -- GeoTransform to (x, y) coord arrays, half-pixel shift for AREA vs POINT, window-aware origin offsets, transform attr construction. Currently inlined separately in every backend.
  • _validation.py -- _validate_chunks_arg, _validate_dtype_cast, band/window/sample-count checks. Already partially extracted into _reader.py; finish the move and let every backend call the same checker.
  • _backends/eager.py -- the local-file synchronous path (the body of today's open_geotiff non-VRT branch).
  • _backends/dask.py -- read_geotiff_dask.
  • _backends/gpu.py -- read_geotiff_gpu.
  • _backends/vrt.py -- read_vrt, with the eager and dask paths inside.
  • _backends/http.py -- the COG-over-HTTP path currently in _reader.py.
  • _writers/ -- to_geotiff, write_geotiff_gpu, write_vrt, plus the streaming strip writer.
  • __init__.py -- thin dispatch only: open_geotiff routing on suffix, public re-exports, version constants.

Add a parity-test matrix: one fixture file, all read paths, assert byte-identical pixel output and identical attrs/coords. Wire it into CI so a future kwarg drift is caught immediately rather than as a follow-up bug report.

Value:

  • Reduces the failure mode that produced the recent parity-bug cluster.
  • Makes each backend reviewable in isolation.
  • Lets chunks=, missing_sources=, on_gpu_failure=, dtype= etc. be wired into one place rather than every backend's signature.

Stakeholders and Impacts

  • All downstream callers of the public API. Public signatures stay identical; the split is internal.
  • CI test runtime grows by the parity-matrix tests. Probably small (one fixture, ~5 backends).

Drawbacks

  • A multi-PR refactor. Each module split lands separately, with the parity tests serving as the contract for "no regression".
  • Git history for the moved code becomes a stack of moves, harder to git blame across.

Alternatives

  • Keep the current layout and rely on review discipline. The pattern of recent parity bugs is the argument against this.
  • Auto-generate per-backend wrappers from a single source-of-truth signature. Heavier machinery than warranted.

Unresolved Questions

  • Whether _backends should be a private subpackage or a flat module per backend at xrspatial.geotiff root.
  • Whether to migrate one backend at a time (preserving the old __init__.py until the last backend is moved) or all at once.

Additional Notes or Context

This is a tracking issue, not a PR. The split is multi-PR scope; individual PRs should reference this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiAPI design and consistencyenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions