Skip to content

Parallelize per-tile compression in streaming write#1531

Merged
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/streaming-write-parallel-compress
May 9, 2026
Merged

Parallelize per-tile compression in streaming write#1531
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/streaming-write-parallel-compress

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

Performance audit finding P4: the streaming tile-write path in xrspatial/geotiff/_writer.py (write_streaming, around line 1403) compressed tiles serially inside each horizontal segment, while the non-streaming write path at _writer.py:~568 already fans compress out to a ThreadPoolExecutor sized at os.cpu_count(). Since zlib, zstd, LZW and LERC release the GIL inside their C codecs, the streaming path was leaving cores idle on compress-bound writes.

This change mirrors the non-streaming pool pattern inside the segment loop:

  1. Walk the segment to materialise the per-tile arrays (the slicing and edge-padding step is cheap and memory-local).
  2. Submit _compress_block to a ThreadPoolExecutor(max_workers=min(n_seg_tiles, os.cpu_count())).
  3. Collect compressed buffers in original tc order and write them to the file sequentially so the on-disk tile layout is unchanged.

The public API of write_streaming is untouched. The serial path is kept for n_seg_tiles <= 1 so single-tile segments avoid the pool overhead.

Memory cost

Peak transient memory grows by tiles_per_segment compressed tiles held in a list before the sequential write phase. For a typical 32-tile segment of ~50 KB compressed tiles that is ~1.6 MB. The pre-existing streaming_buffer_bytes cap already bounds tiles_per_segment, so the worst-case memory footprint is bounded by user configuration.

Measured speedup

4096x4096 float32 deflate streaming write on a local box:

variant time
serial-equivalent (os.cpu_count=1) 1.69 s
parallel pool 0.27 s

About 6.2x, matching the audit estimate.

Test plan

  • New tests in xrspatial/geotiff/tests/test_streaming_write_parallel.py:
    • parametrised round-trip across dtype (float32, uint16, uint8) x compression (deflate, zstd, lzw, none) x tile size, with a forced small streaming_buffer_bytes so the multi-segment branch is exercised. Bit-exact vs eager write and vs the source array.
    • parallelism observed: monkeypatch _compress_block to record threading.get_ident() per call, confirm more than one worker thread participates.
    • perf sanity tripwire: 4096x4096 deflate streaming write must finish under 5 s.
    • single-thread fallback placeholder, currently skipped because write_streaming does not expose a threads kwarg.
  • Regression: pytest xrspatial/geotiff/tests/test_streaming_write.py xrspatial/geotiff/tests/test_writer.py xrspatial/geotiff/tests/test_writer_matrix.py -> 123 passed.

The streaming tile-write path in write_streaming previously walked
each segment's tiles serially, calling _compress_block inline. The
non-streaming path at _writer.py:~568 already fans compress out to a
ThreadPoolExecutor sized at os.cpu_count(), since zlib, zstd, LZW and
LERC all release the GIL inside their C codecs.

Mirror that pattern inside the segment loop: build the tile arrays
sequentially, submit compress to a per-segment thread pool sized
min(n_seg_tiles, os.cpu_count()), then write the resulting buffers
to the file sequentially. The file write stays serial so the on-disk
tile layout is unchanged.

Memory cost is bounded by the segment size: tiles_per_segment
compressed buffers held briefly in RAM. For a 32-tile segment with
~50 KB compressed tiles that is ~1.6 MB. The streaming buffer cap
already bounds segment size, so peak memory growth is small.

Measured on a 4096x4096 float32 deflate streaming write: 1.69 s
serial-equivalent vs 0.27 s with the pool, a 6.2x speedup that
matches the audit estimate.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026
@brendancol brendancol requested a review from Copilot May 9, 2026 12:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves GeoTIFF streaming writes (write_streaming) by parallelizing per-tile compression within each horizontal segment, mirroring the existing parallel compression approach used in the non-streaming tiled writer. This targets compress-bound streaming writes where codecs release the GIL.

Changes:

  • Parallelize per-segment tile compression in write_streaming using a ThreadPoolExecutor, then write compressed tiles sequentially to preserve on-disk order.
  • Add a new test module covering round-trip correctness, observable parallelism, and a performance regression guard for streaming writes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
xrspatial/geotiff/_writer.py Builds per-tile arrays per segment, compresses them in parallel via a thread pool, then writes buffers sequentially.
xrspatial/geotiff/tests/test_streaming_write_parallel.py Adds correctness/parallelism/perf-guard tests for the new streaming parallel compression behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +138 to +145
path = str(tmp_path / 'parallel_check.tif')
to_geotiff(dask_da, path,
compression='deflate', tile_size=256)

assert len(seen_threads) > 1, (
f"Expected >1 worker threads in streaming compress, "
f"saw {len(seen_threads)}: {seen_threads}")

Comment on lines +163 to +173
t0 = time.perf_counter()
to_geotiff(dask_da, path, compression='deflate', tile_size=256)
elapsed = time.perf_counter() - t0

# Sanity check the file was written.
result = open_geotiff(path)
assert result.shape == shape

assert elapsed < 5.0, (
f"Streaming 4096x4096 deflate write took {elapsed:.2f}s, "
f"expected <5s (regression guard)")
Comment on lines +180 to +195
def test_streaming_write_with_single_thread_fallback(tmp_path):
"""If write_streaming exposes a ``threads`` kwarg, callers can opt
into deterministic single-thread compress. Currently it does not -
skip so the test stays as a placeholder for when a knob is added.
"""
sig = inspect.signature(writer_mod.write_streaming)
if 'threads' not in sig.parameters:
pytest.skip(
"write_streaming has no 'threads' kwarg yet; skipping "
"deterministic single-thread fallback test")

da = _make_dataarray((400, 400), dtype=np.float32)
dask_da = da.chunk({'y': 200, 'x': 200})
path = str(tmp_path / 'threads1.tif')
to_geotiff(dask_da, path, compression='deflate', threads=1)
result = open_geotiff(path)
Comment thread xrspatial/geotiff/_writer.py Outdated
Comment on lines +1442 to +1448
from concurrent.futures import (
ThreadPoolExecutor)
n_workers = min(n_seg_tiles,
os.cpu_count() or 4)
with ThreadPoolExecutor(
max_workers=n_workers) as pool:
futures = [
Comment thread xrspatial/geotiff/_writer.py Outdated
Comment on lines +1429 to +1431
# Memory cost is bounded by the segment size
# (tiles_per_segment compressed buffers held in RAM
# before the sequential write phase below).
Five findings, all acted on:

- Hoist the ThreadPoolExecutor over the entire tiled streaming write
  rather than recreating it per segment. For wide rasters with many
  horizontal segments the per-segment construction was paying the
  thread-startup cost on every stripe and offsetting the parallel
  speedup. Now one pool spans every (tile_row, segment) iteration.
- Skip the pool entirely when compression is uncompressed
  (COMPRESSION_NONE has no C-level work to release the GIL on) or
  when the host has only one usable core. Both cases fall through to
  the existing serial path.
- Update the per-segment memory-cost comment to mention BOTH the
  uncompressed seg_tile_arrs and the compressed buffers, since both
  are held simultaneously while futures resolve.
- test_streaming_write_parallelism_observed now monkeypatches
  os.cpu_count to return 4 so the assertion stays deterministic on
  single-core CI containers. Without the patch the pool would size
  to 1 and the test would fail for environment reasons.
- The wall-clock perf tripwire is gated behind XRSPATIAL_RUN_PERF_TESTS=1
  to avoid CI flakiness on shared/throttled runners; deterministic
  parallel-branch coverage already lives in the parallelism-observed
  test.
- Drop the test_streaming_write_with_single_thread_fallback placeholder.
  It gated on a `threads=` kwarg that doesn't exist and isn't planned;
  reviewing the gate-vs-call path for a non-existent kwarg adds noise
  to future readers.
@brendancol brendancol merged commit d114de5 into xarray-contrib:main May 9, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants