Parallelize per-tile compression in streaming write by brendancol · Pull Request #1531 · xarray-contrib/xarray-spatial

brendancol · 2026-05-08T20:52:42Z

Summary

Performance audit finding P4: the streaming tile-write path in xrspatial/geotiff/_writer.py (write_streaming, around line 1403) compressed tiles serially inside each horizontal segment, while the non-streaming write path at _writer.py:~568 already fans compress out to a ThreadPoolExecutor sized at os.cpu_count(). Since zlib, zstd, LZW and LERC release the GIL inside their C codecs, the streaming path was leaving cores idle on compress-bound writes.

This change mirrors the non-streaming pool pattern inside the segment loop:

Walk the segment to materialise the per-tile arrays (the slicing and edge-padding step is cheap and memory-local).
Submit _compress_block to a ThreadPoolExecutor(max_workers=min(n_seg_tiles, os.cpu_count())).
Collect compressed buffers in original tc order and write them to the file sequentially so the on-disk tile layout is unchanged.

The public API of write_streaming is untouched. The serial path is kept for n_seg_tiles <= 1 so single-tile segments avoid the pool overhead.

Memory cost

Peak transient memory grows by tiles_per_segment compressed tiles held in a list before the sequential write phase. For a typical 32-tile segment of ~50 KB compressed tiles that is ~1.6 MB. The pre-existing streaming_buffer_bytes cap already bounds tiles_per_segment, so the worst-case memory footprint is bounded by user configuration.

Measured speedup

4096x4096 float32 deflate streaming write on a local box:

variant	time
serial-equivalent (`os.cpu_count`=1)	1.69 s
parallel pool	0.27 s

About 6.2x, matching the audit estimate.

Test plan

New tests in xrspatial/geotiff/tests/test_streaming_write_parallel.py:
- parametrised round-trip across dtype (float32, uint16, uint8) x compression (deflate, zstd, lzw, none) x tile size, with a forced small streaming_buffer_bytes so the multi-segment branch is exercised. Bit-exact vs eager write and vs the source array.
- parallelism observed: monkeypatch _compress_block to record threading.get_ident() per call, confirm more than one worker thread participates.
- perf sanity tripwire: 4096x4096 deflate streaming write must finish under 5 s.
- single-thread fallback placeholder, currently skipped because write_streaming does not expose a threads kwarg.
Regression: pytest xrspatial/geotiff/tests/test_streaming_write.py xrspatial/geotiff/tests/test_writer.py xrspatial/geotiff/tests/test_writer_matrix.py -> 123 passed.

The streaming tile-write path in write_streaming previously walked each segment's tiles serially, calling _compress_block inline. The non-streaming path at _writer.py:~568 already fans compress out to a ThreadPoolExecutor sized at os.cpu_count(), since zlib, zstd, LZW and LERC all release the GIL inside their C codecs. Mirror that pattern inside the segment loop: build the tile arrays sequentially, submit compress to a per-segment thread pool sized min(n_seg_tiles, os.cpu_count()), then write the resulting buffers to the file sequentially. The file write stays serial so the on-disk tile layout is unchanged. Memory cost is bounded by the segment size: tiles_per_segment compressed buffers held briefly in RAM. For a 32-tile segment with ~50 KB compressed tiles that is ~1.6 MB. The streaming buffer cap already bounds segment size, so peak memory growth is small. Measured on a 4096x4096 float32 deflate streaming write: 1.69 s serial-equivalent vs 0.27 s with the pool, a 6.2x speedup that matches the audit estimate.

Copilot

Pull request overview

This PR improves GeoTIFF streaming writes (write_streaming) by parallelizing per-tile compression within each horizontal segment, mirroring the existing parallel compression approach used in the non-streaming tiled writer. This targets compress-bound streaming writes where codecs release the GIL.

Changes:

Parallelize per-segment tile compression in write_streaming using a ThreadPoolExecutor, then write compressed tiles sequentially to preserve on-disk order.
Add a new test module covering round-trip correctness, observable parallelism, and a performance regression guard for streaming writes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`xrspatial/geotiff/_writer.py`	Builds per-tile arrays per segment, compresses them in parallel via a thread pool, then writes buffers sequentially.
`xrspatial/geotiff/tests/test_streaming_write_parallel.py`	Adds correctness/parallelism/perf-guard tests for the new streaming parallel compression behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    path = str(tmp_path / 'parallel_check.tif')
+    to_geotiff(dask_da, path,
+               compression='deflate', tile_size=256)
+
+    assert len(seen_threads) > 1, (
+        f"Expected >1 worker threads in streaming compress, "
+        f"saw {len(seen_threads)}: {seen_threads}")
+


+    t0 = time.perf_counter()
+    to_geotiff(dask_da, path, compression='deflate', tile_size=256)
+    elapsed = time.perf_counter() - t0
+
+    # Sanity check the file was written.
+    result = open_geotiff(path)
+    assert result.shape == shape
+
+    assert elapsed < 5.0, (
+        f"Streaming 4096x4096 deflate write took {elapsed:.2f}s, "
+        f"expected <5s (regression guard)")


+def test_streaming_write_with_single_thread_fallback(tmp_path):
+    """If write_streaming exposes a ``threads`` kwarg, callers can opt
+    into deterministic single-thread compress. Currently it does not -
+    skip so the test stays as a placeholder for when a knob is added.
+    """
+    sig = inspect.signature(writer_mod.write_streaming)
+    if 'threads' not in sig.parameters:
+        pytest.skip(
+            "write_streaming has no 'threads' kwarg yet; skipping "
+            "deterministic single-thread fallback test")
+
+    da = _make_dataarray((400, 400), dtype=np.float32)
+    dask_da = da.chunk({'y': 200, 'x': 200})
+    path = str(tmp_path / 'threads1.tif')
+    to_geotiff(dask_da, path, compression='deflate', threads=1)
+    result = open_geotiff(path)


+                            from concurrent.futures import (
+                                ThreadPoolExecutor)
+                            n_workers = min(n_seg_tiles,
+                                            os.cpu_count() or 4)
+                            with ThreadPoolExecutor(
+                                    max_workers=n_workers) as pool:
+                                futures = [


+                        # Memory cost is bounded by the segment size
+                        # (tiles_per_segment compressed buffers held in RAM
+                        # before the sequential write phase below).


Five findings, all acted on: - Hoist the ThreadPoolExecutor over the entire tiled streaming write rather than recreating it per segment. For wide rasters with many horizontal segments the per-segment construction was paying the thread-startup cost on every stripe and offsetting the parallel speedup. Now one pool spans every (tile_row, segment) iteration. - Skip the pool entirely when compression is uncompressed (COMPRESSION_NONE has no C-level work to release the GIL on) or when the host has only one usable core. Both cases fall through to the existing serial path. - Update the per-segment memory-cost comment to mention BOTH the uncompressed seg_tile_arrs and the compressed buffers, since both are held simultaneously while futures resolve. - test_streaming_write_parallelism_observed now monkeypatches os.cpu_count to return 4 so the assertion stays deterministic on single-core CI containers. Without the patch the pool would size to 1 and the test would fail for environment reasons. - The wall-clock perf tripwire is gated behind XRSPATIAL_RUN_PERF_TESTS=1 to avoid CI flakiness on shared/throttled runners; deterministic parallel-branch coverage already lives in the parallelism-observed test. - Drop the test_streaming_write_with_single_thread_fallback placeholder. It gated on a `threads=` kwarg that doesn't exist and isn't planned; reviewing the gate-vs-call path for a non-existent kwarg adds noise to future readers.

github-actions Bot added the performance PR touches performance-sensitive code label May 8, 2026

brendancol requested a review from Copilot May 9, 2026 12:57

Copilot started reviewing on behalf of brendancol May 9, 2026 12:57 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

brendancol merged commit d114de5 into xarray-contrib:main May 9, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize per-tile compression in streaming write#1531

Parallelize per-tile compression in streaming write#1531
brendancol merged 2 commits intoxarray-contrib:mainfrom
brendancol:perf/streaming-write-parallel-compress

brendancol commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brendancol commented May 8, 2026

Summary

Memory cost

Measured speedup

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants