diff --git a/.gitignore b/.gitignore
index df17db9..090f568 100644
--- a/.gitignore
+++ b/.gitignore
@@ -7,3 +7,14 @@ EBCC/
 venv/
 out-*
 *.csv
+# Exception: the L1 error threshold lookup table is the local fallback
+# used by `evaluate_combos` when the remote Google Sheet is unreachable.
+# It must be tracked despite the blanket *.csv rule above.
+!src/dc_toolkit/data/l1_error_thresholds.csv
+*.parquet
+*manifest*.json
+sample*.json
+res_folder_dyamond_*/
+dyamond_inventory_*/
+sweep_batches*/
+VAST_analysis/
diff --git a/Dockerfile b/Dockerfile
index 5306406..dc1495a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -25,7 +25,5 @@ ENV PATH="/opt/venv/bin:$PATH"
 
 RUN bash install_dc_toolkit.sh
 
-RUN pip install --force-reinstall "dask[complete]==2025.7.0" "numpy==2.2.6"
-
 ENTRYPOINT ["dc_toolkit"]
 CMD ["--help"]
diff --git a/README.md b/README.md
index e5c5fe1..a135c77 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
   <img src="./data-compression_logo.png" alt="Logo" width="300"/>
 </div>
 
-Set of tools for compressing netCDF files with Zarr. 
+Set of tools for compressing netCDF files with Zarr.
 
 The tools use the following compression libraries:
 
@@ -39,10 +39,12 @@ uenv start --view=default $UENV_NAME
 once the above is complete (just for Santis, locally it is not needed):
 
 ```commandline
-git clone git@github.com:C2SM/data-compression.git
+git clone git@github.com:C2SM/data-compression.git dc_toolkit
+cd dc_toolkit
 python -m venv venv
 source venv/bin/activate
 bash install_dc_toolkit.sh
+source venv/bin/activate
 ```
 
 ## Usage
@@ -50,25 +52,79 @@ bash install_dc_toolkit.sh
 ```
 --------------------------------------------------------------------------------
 
-Usage: dc_toolkit --help # List of available commands
-
-Usage: dc_toolkit COMMAND --help # Documentation per command
+Usage: dc_toolkit --help           # List of available commands
+Usage: dc_toolkit COMMAND --help   # Documentation per command
 
 Example:
 
-dc_toolkit \ # CLI-tool
-  evaluate_combos \ # command
-  netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc \ # netCDF file to compress
-  ./dump \ # where to write the compressed file(s)
-  --field-to-compress t # field of netCDF to compress
+dc_toolkit \                                                # CLI-tool
+  evaluate_combos \                                         # command
+  netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc \            # netCDF file
+  --where-to-write ./dump \                                 # output directory
+  --field-to-compress t \                                   # field to sweep
+  --eval-data-size-limit 5GB                                # sample size
 
 --------------------------------------------------------------------------------
 ```
 
+### End-to-end workflow
+
+The typical pipeline is three commands:
+
+1. **`evaluate_combos`** — sweep `(compressor × filter × serializer)` combinations on a representative sample of the field and record compression ratio / error metrics per combo.
+2. **`compress_with_optimal`** (one field at a time) or **`compress_fields_from_results`** (batch: all fields at once, dataset opened once) — persist the field(s) into a shared `.zarr` store using the winning combo from step 1, at production chunk/shard sizes.
+3. **`merge_compressed_fields`** — consolidate metadata on the shared store so downstream readers can open it quickly without scanning every array.
+
+> **Important:** pass the **same `--eval-data-size-limit`** to step 2 as you used in step 1. The `(comp_idx, filt_idx, ser_idx)` tuple from the sweep indexes into a codec space whose statistical parameters (e.g. `Asinh.linear_width`) are derived from the sample — change the sample size and the tuple can resolve to a slightly different codec object. Symptom: worse compression ratio at persist time than the sweep reported, no error.
+
+### Output files
+
+`evaluate_combos` writes the following per variable `{var}` into `--where-to-write`:
+
+| File | What it is |
+|------|------------|
+| `config_space_{var}.csv` | Full planning space — the Cartesian product that was going to be evaluated. Input to `analyze_clustering`. |
+| `config_space_{var}_rank{N}.csv` | Per-rank streaming audit trail, flushed per row. Useful to tail during long sweeps or to inspect after a crash. |
+| `results_{var}.parquet` | Consolidated results across all ranks, with a `keep` column distinguishing passing and filtered-out combos. The canonical file for analysis. |
+| `*_scored_results_with_names.npy` | Kept-only scored configs in numpy structured-array form. Input to `perform_clustering` / `analyze_clustering`. |
+| `manifest_{var}.json` | Best combo per variable. Read by `compress_fields_from_results` to drive the batch persist. |
+
+`compress_with_optimal` and `compress_fields_from_results` write the compressed data into `{where_to_write}/{dataset_basename}.zarr`, under one group per variable. `batch_manifest.json` summarises a batch run.
+
+### HPC parallelism (SLURM / MPI)
+
+> For a thorough walkthrough of how every command parallelizes work — including how the `--bypass-zarr-sync` machinery actually works, why we cap at 32 threads on a 288-core node, and the chunk-vs-shard distinction — see [`docs/PARALLELIZATION.md`](docs/PARALLELIZATION.md).
+
+`evaluate_combos` runs as **one MPI rank per node**, with each rank driving 32 user threads via the `--bypass-zarr-sync` machinery (default on).  Scale out by increasing `--nodes` and keeping `--ntasks-per-node=1`:
+
+```bash
+#SBATCH --nodes=8 --ntasks-per-node=1 --cpus-per-task=32
+
+srun --unbuffered dc_toolkit evaluate_combos input.nc \
+    --where-to-write ./out \
+    --field-to-compress t \
+    --eval-data-size-limit 5GB \
+    --threads-per-rank 32
+```
+
+This topology was selected over multi-rank-per-node (32 ranks × 1 thread, the original design) to avoid OOM on large fields — the latter duplicates the sample buffer once per rank.  See `santis.run` for the validated production driver and the inline comment block summarising the experiments behind the choice.
+
+Codec-internal thread pools must be pinned to 1 to avoid nested oversubscription (the tool checks this at startup and aborts by default; `--no-oversubscription-check` disables the guard):
+
+```bash
+export OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 \
+       BLOSC_NTHREADS=1 NUMBA_NUM_THREADS=1 \
+       VECLIB_MAXIMUM_THREADS=1 OMP_THREAD_LIMIT=1
+```
+
+The `--codec-threads N` flag (default 1) is available on `evaluate_combos`, `compress_with_optimal`, `compress_fields_from_results`, and `from_zarr_to_netcdf` for use cases where codec-internal threading is genuinely needed (e.g. very large chunks on workloads that aren't memory-bandwidth-bound).  Direct testing on Santis with the production Dyamond data showed `--codec-threads > 1` does **not** help on this workload; leave it at the default unless you have a specific reason and can A/B test the change.
+
+`compress_with_optimal`, `compress_fields_from_results`, and `merge_compressed_fields` are single-process commands — launch with `srun -n 1 ...` or plain invocation. Parallelism inside the write comes from dask's threaded scheduler, tuned via `--threads` (default: auto-detected from visible cores), `--inner-chunk-mib` (default: 16), and `--shard-mib` (default: 512). `--verify/--no-verify` (default on) re-reads the store to compute error norms — skip with `--no-verify` on re-compression runs where the combo is already trusted.
+
 ## UI implementation
 
 Two User Interfaces have been implemented to make the file compression process more user-friendly.
-Both UIs provide functionlaities for compressors similarity metrics and file compression.
+Both UIs provide functionalities for compressors similarity metrics and file compression.
 
 Outside of the mutual UI functionalities, this UI allows users to download similarity metrics plots and tweak parameters more dynamically.
 
@@ -77,11 +133,11 @@ If launched from santis, make sure to ssh correctly:
 ssh -L 8501:localhost:8501 santis
 ```
 ```
-dc_toolkit run_web_ui_vcluster \ 
-  --user_account "YOUR_USER_ACCOUNT" \ 
+dc_toolkit run_web_ui_vcluster \
+  --user_account "YOUR_USER_ACCOUNT" \
   --uenv_image UENV_NAME \
-  --uploaded_file "PATH_TO_FILE" \ 
-  --time "00:15:00" \ 
+  --uploaded_file "PATH_TO_FILE" \
+  --time "00:15:00" \
   --nodes "1" --ntasks-per-node "72"
 ```
 Local web-versions and non are also available:
@@ -99,8 +155,8 @@ A self-contained image has been setup in the `Dockerfile`. You can copy the file
 ```commandline
 docker build -t dc-toolkit .
 ```
-The image contains all dependencies and automatically clones the repository. 
-Once this build is complete, you can run commands with docker. An example: 
+The image contains all dependencies and automatically clones the repository.
+Once this build is complete, you can run commands with docker. An example:
 
 ```commandline
 docker run \
@@ -110,7 +166,7 @@ docker run \
   -e XDG_CACHE_HOME=/tmp/.cache \
   --entrypoint /bin/bash \
   dc-toolkit \
-  -c 'mkdir -p docker_saved_files && dc_toolkit evaluate_combos /opt/data-compression/netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc /mnt/data/docker_saved_files --field-to-compress t'
+  -c 'mkdir -p docker_saved_files && dc_toolkit evaluate_combos /opt/data-compression/netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc --where-to-write /mnt/data/docker_saved_files --field-to-compress t'
 ```
 
 **Command Breakdown:**
@@ -123,8 +179,15 @@ docker run \
 * **`dc-toolkit`**: The name of the Docker image to run.
 * **`-c '...'`**: Executes a custom shell command to handle the complex environment setup:
   * **`mkdir -p docker_saved_files`**: Creates an output directory on your host.
-  * **`dc_toolkit evaluate_combos ...`**: Executes the actual compression tool, using a file inside the container and saving the results to your mounted volume.
+  * **`dc_toolkit evaluate_combos ...`**: Executes the actual compression tool, using a file inside the container and saving the results (under `--where-to-write`) to your mounted volume.
 
+Single-machine runs (Docker included) get their parallelism from the node-local `ThreadPoolExecutor` inside a single MPI rank — no multi-rank `mpirun` is needed. Also make sure to pin codec-internal thread pools so they don't fight the outer threads:
+
+```bash
+-e OMP_NUM_THREADS=1 -e MKL_NUM_THREADS=1 -e OPENBLAS_NUM_THREADS=1 \
+-e BLOSC_NTHREADS=1 -e NUMBA_NUM_THREADS=1 \
+-e VECLIB_MAXIMUM_THREADS=1 -e OMP_THREAD_LIMIT=1
+```
 
 Or for the web UI:
 
@@ -132,69 +195,73 @@ Or for the web UI:
 docker run -p 8501:8501 dc-toolkit run_web_ui
 ```
 
-### Running with MPI (Parallel Processing)
-
-To drastically speed up the evaluation process, you can run the toolkit in parallel using OpenMPI. 
+### Running with MPI (single-container, exercises the MPI code path)
 
-Because running MPI inside Docker requires some specific file permission and cache handling, use the following commands to securely mount your directories and isolate the process caches depending on your operating system.
+OpenMPI + Docker requires specific file permission and cache handling. Note that on a single container `evaluate_combos` runs with **one** MPI rank (`-n 1`) — the rank-per-node invariant means multi-rank on one node is not supported. Parallel work inside the single rank is done by the `ThreadPoolExecutor`; the `mpirun` launch is useful for exercising the MPI code path in CI or smoke tests. For real multi-node speedup, use SLURM (see the HPC section above).
 
 ---
 
 #### Mac and Linux
 
-For Unix-based systems, we map your local user ID to the container to avoid permission issues and assign unique temporary directories to isolate caches.
-
 ```bash
 docker run \
   -u $(id -u):$(id -g) \
   -w /mnt/data/docker_saved_files \
   -v $(pwd)/netCDF_files:/mnt/data \
+  -e OMP_NUM_THREADS=1 -e MKL_NUM_THREADS=1 -e OPENBLAS_NUM_THREADS=1 \
+  -e BLOSC_NTHREADS=1 -e NUMBA_NUM_THREADS=1 \
+  -e VECLIB_MAXIMUM_THREADS=1 -e OMP_THREAD_LIMIT=1 \
   --entrypoint mpirun \
   dc-toolkit \
-  -n 8 \
-  bash -c 'HOME=/tmp/$OMPI_COMM_WORLD_RANK exec dc_toolkit evaluate_combos /opt/data-compression/netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc /mnt/data/docker_saved_files --field-to-compress t'
+  -n 1 \
+  bash -c 'HOME=/tmp/$OMPI_COMM_WORLD_RANK exec dc_toolkit evaluate_combos /opt/data-compression/netCDF_files/tigge_pl_t_q_dx=2_2024_08_02.nc --where-to-write /mnt/data/docker_saved_files --field-to-compress t --eval-data-size-limit 5GB'
 ```
 
 **Command Breakdown:**
 
-* **`-u $(id -u):$(id -g)`**: Runs the container using your local machine's User and Group IDs rather than the Docker default `root`. This guarantees that compressed files output to your machine, are fully owned by you and aren't locked behind root permissions.
+* **`-u $(id -u):$(id -g)`**: Runs the container as your local user so outputs aren't locked behind `root` permissions.
 * **`-w /mnt/data/docker_saved_files`**: Sets the Working Directory.
-* **`-v $(pwd)/netCDF_files:/mnt/data`**: The volume mount. This creates a bridge between your local computer and the container so the toolkit can read your input data and write the results back to your hard drive.
-* **`--entrypoint mpirun`**: Tells Docker to bypass the image's default entrypoint and boot up using OpenMPI's runner instead.
-* **`dc-toolkit`**: The name of the Docker image to run.
-* **`-n 8`**: Tells `mpirun` to spin up 8 parallel processes.
-* **`bash -c '...'`**: Executes a custom shell command across all 8 processes to handle the complex environment setup:
-  * **`HOME=/tmp/$OMPI_COMM_WORLD_RANK`**: Assigns a mathematically unique, temporary "Home" directory to each process. This completely eliminates race conditions where multiple processes try to write to the exact same  cache simultaneously.
-  * **`exec dc_toolkit evaluate_combos ...`**: Executes the actual compression tool, passing the paths (as they appear *inside* the container's `/mnt/data` mount) to the input NetCDF file and the designated output directory.
+* **`-v $(pwd)/netCDF_files:/mnt/data`**: Volume mount bridging local and container filesystems.
+* **`-e OMP_NUM_THREADS=1 ...`**: Pins codec-internal thread pools to 1 so they don't nest against the `ThreadPoolExecutor` inside the rank.
+* **`--entrypoint mpirun`**: Bypasses the default entrypoint to launch via OpenMPI.
+* **`dc-toolkit`**: The image name.
+* **`-n 1`**: One MPI rank per node; on a Docker container that's one rank total. Parallelism inside the rank comes from threads, not from multiple ranks.
+* **`bash -c '...'`**: Executes the dc_toolkit command:
+  * **`HOME=/tmp/$OMPI_COMM_WORLD_RANK`**: Assigns a unique `$HOME` per rank — harmless with `-n 1`, kept for parity with multi-rank launches.
+  * **`exec dc_toolkit evaluate_combos ... --where-to-write /mnt/data/docker_saved_files ...`**: Runs the sweep, writing all outputs into the mounted volume.
 
 ---
 
 #### Windows (PowerShell)
 
-When using Docker Desktop on Windows via WSL 2, Docker handles file permissions differently. You do not need to pass your user ID (as Docker Desktop handles the translation automatically), but you do need to explicitly allow OpenMPI to run as root and format your paths for PowerShell.
+When using Docker Desktop on Windows via WSL 2, Docker handles file permissions differently. You don't need to pass your user ID (Docker Desktop handles the translation automatically), but you do need to explicitly allow OpenMPI to run as root and format your paths for PowerShell.
 
 ```powershell
 docker run `
   -e HOME=/tmp `
+  -e OMP_NUM_THREADS=1 -e MKL_NUM_THREADS=1 -e OPENBLAS_NUM_THREADS=1 `
+  -e BLOSC_NTHREADS=1 -e NUMBA_NUM_THREADS=1 `
+  -e VECLIB_MAXIMUM_THREADS=1 -e OMP_THREAD_LIMIT=1 `
   -w /mnt/data/docker_saved_files `
   -v "${PWD}\netCDF_files:/mnt/data" `
   --entrypoint mpirun `
   dc-toolkit `
   --allow-run-as-root `
-  -n 8 `
-  bash -c "HOME=/tmp/`$OMPI_COMM_WORLD_RANK exec dc_toolkit evaluate_combos /mnt/data/tigge_pl_t_q_dx=2_2024_08_02.nc /mnt/data/docker_saved_files --field-to-compress t"
+  -n 1 `
+  bash -c "HOME=/tmp/`$OMPI_COMM_WORLD_RANK exec dc_toolkit evaluate_combos /mnt/data/tigge_pl_t_q_dx=2_2024_08_02.nc --where-to-write /mnt/data/docker_saved_files --field-to-compress t --eval-data-size-limit 5GB"
 ```
 
 **Command Breakdown:**
 
 * **`-e HOME=/tmp`**: Sets a base temporary home directory for the container environment.
-* **`-w /mnt/data/docker_saved_files`**: Sets the Working Directory inside the container so output files (like `config_space.csv`) drop exactly into your mounted folder.
-* **`-v "${PWD}\netCDF_files:/mnt/data"`**: The Windows equivalent of the volume mount. `${PWD}` dynamically grabs your current PowerShell directory to link your local files to the container.
+* **`-e OMP_NUM_THREADS=1 ...`**: Pins codec-internal thread pools to 1 (prevents nested oversubscription).
+* **`-w /mnt/data/docker_saved_files`**: Sets the Working Directory inside the container so output files (like `config_space_{var}.csv` and `results_{var}.parquet`) drop exactly into your mounted folder.
+* **`-v "${PWD}\netCDF_files:/mnt/data"`**: Windows equivalent of the volume mount. `${PWD}` dynamically grabs your current PowerShell directory to link your local files to the container.
 * **`--entrypoint mpirun`**: Bypasses the default container start command to run OpenMPI.
-* **`dc-toolkit`**: The name of the Docker image.
-* **`--allow-run-as-root`**: Because the container defaults to the `root` user on Windows, this flag is required to bypass OpenMPI's built-in safety restrictions against running parallel jobs as root.
-* **`-n 8`**: Tells `mpirun` to spin up 8 parallel processes.
-* **`bash -c "..."`**: Executes the parallel command. Notice that double-quotes are used here for PowerShell, with an escaped backtick (` `$ `) in front of the MPI variable to prevent PowerShell from prematurely evaluating it on your host machine before it reaches the container.
+* **`dc-toolkit`**: The image name.
+* **`--allow-run-as-root`**: The container defaults to `root` on Windows; this flag bypasses OpenMPI's built-in safety restrictions against running parallel jobs as root.
+* **`-n 1`**: One rank per node; on a Docker container that's one rank total.
+* **`bash -c "..."`**: Executes the parallel command. Note double-quotes for PowerShell, with an escaped backtick (` `$ `) in front of the MPI variable to prevent PowerShell from evaluating it on your host before it reaches the container.
 
 ## Slides
 
diff --git a/docs/PARALLELIZATION.md b/docs/PARALLELIZATION.md
new file mode 100644
index 0000000..9740b43
--- /dev/null
+++ b/docs/PARALLELIZATION.md
@@ -0,0 +1,228 @@
+# Parallelization strategies in `dc_toolkit`
+
+This document explains, in plain English, how each `dc_toolkit` command uses parallelism. It's intended both for new users trying to understand how the toolkit makes use of an HPC node, and as a future-reference note for the maintainers about *why* we made the choices we did.
+
+It is deliberately not exhaustive — for the full architectural details, read `src/dc_toolkit/cli.py` and `src/dc_toolkit/utils.py`. This doc is the user-friendly tour.
+
+---
+
+## The big picture
+
+`dc_toolkit` has three kinds of commands, and each kind uses a different parallelism style:
+
+| Kind | Commands | What runs in parallel |
+|---|---|---|
+| **A. Big sweep** | `evaluate_combos` | Many independent codec configurations, across nodes and threads |
+| **B. Big single write** | `compress_with_optimal`, `compress_fields_from_results`, `from_nc_to_zarr`, `from_zarr_to_netcdf` | The chunks of one big array, on one node |
+| **C. Light single-threaded utilities** | `merge_compressed_fields`, `open_zarr_and_inspect`, `perform_clustering`, `analyze_clustering`, `plot_compression_errors`, all UI commands | Nothing — pure single-threaded Python |
+
+The first two kinds are interesting and the rest of this document is about them. Kind C commands do small post-processing work (metadata consolidation, clustering of result tables, plot generation, web UIs) where parallelism would not change anything meaningful.
+
+---
+
+## Why threading works at all in Python (the GIL question)
+
+A common worry: "Python has a Global Interpreter Lock (the GIL), so threads can't actually run in parallel — right?"
+
+This is *technically* true but *practically* irrelevant for our workload. The GIL only blocks Python code from running on multiple threads at the same time. It does **not** block code written in C/C++/Rust — and the heavy work in `dc_toolkit` is all in such libraries:
+
+- Compression itself (Blosc, zfp, EBCC, Quantize, BitRound, …) is C/C++ code that releases the GIL while compressing.
+- numpy reductions used to compute error norms (L1/L2/L∞) are also C code that releases the GIL.
+- HDF5 / NetCDF / zarr file I/O releases the GIL during reads and writes.
+
+Roughly **99% of each compression operation's wall time** is spent inside such GIL-releasing C code. So 32 threads in a Python process can keep 32 cores genuinely busy doing real compression work, all in parallel, with the GIL only briefly serializing the small Python orchestration glue between calls.
+
+The rule of thumb: Python threading works well when the heavy work is in compiled libraries. Compression is one of those workloads. Pure-Python loops (think: parsing a big JSON file in Python code) would be a different story — but that's not what we do.
+
+---
+
+## `evaluate_combos`: parallelism in two layers
+
+This is the most parallelism-intensive command in the toolkit. It evaluates ~13,000 different codec configurations against a representative sample of a field, scores each one, and picks the best.
+
+The work is *embarrassingly parallel*: each combo is independent. The toolkit takes advantage of this on two layers.
+
+### Layer 1: across nodes (MPI)
+
+Run with N nodes, one MPI rank per node:
+
+```bash
+SBATCH --nodes=8 --ntasks-per-node=1 --cpus-per-task=32
+```
+
+The 13,000 combos get split across the 8 ranks deterministically (rank 0 takes every 8th combo starting at 0, rank 1 takes every 8th starting at 1, etc). Each rank works on its slice independently — no coordination during the sweep, just a final result-gather at the end.
+
+Adding more nodes gives a clean linear speedup at this layer. Two nodes process combos roughly 2× faster than one; eight nodes process them 8× faster than one.
+
+### Layer 2: within one node (threads + the bypass)
+
+Now zoom into one rank — one Python process on one node. It has ~1,625 combos to evaluate. To use the node's 32 cores, the rank uses a `ThreadPoolExecutor` with 32 worker threads. Each worker thread grabs a combo, evaluates it end to end, returns the result, then grabs the next.
+
+This is where things get interesting. The codec libraries we use are wrapped by zarr (a chunked-array storage library), and zarr has an internal scheduling mechanism (technically: an *asyncio event loop* — a single-threaded scheduler that runs *coroutines*, async functions that pause and resume cooperatively) that — in its standard form — assumes only one thread is calling it at a time. With our 32 threads all calling zarr concurrently, this scheduler becomes a bottleneck.
+
+We measured this directly: 32 threads with the standard zarr scheduling was ~5× slower than 32 separate MPI ranks doing the same total work. The 5× slowdown was *not* the GIL — it was zarr's internal queuing serializing all 32 threads through one scheduler (the single event loop running on one OS thread, processing one coroutine at a time).
+
+**The bypass** (`--bypass-zarr-sync`, default on) fixes this. Instead of all 32 threads sharing zarr's one scheduler, the bypass gives each thread its own scheduler (technically: a *per-thread event loop* — every user thread runs its own asyncio loop on its own OS thread, so the 32 loops progress in parallel with no shared chokepoint). Now all 32 threads can call zarr concurrently with no shared bottleneck. Coupled with one shared, bounded pool of 32 codec worker threads (technically: a `concurrent.futures.ThreadPoolExecutor` registered as the *default executor* on every per-thread loop, so the loops don't each spawn their own pool), the rank's parallelism becomes:
+
+```
+   32 user threads (one combo each, mostly waiting)
+                        │
+                        ▼
+   32 codec worker threads (the actual compression work)
+                        │
+                        ▼
+   32 cores busy doing real work in parallel
+```
+
+In other words: the bypass removes a serialization point that exists in zarr's defaults. It is *not* a workaround for Python or for the GIL — it's an adaptation to the specific multi-threaded usage pattern we need.
+
+### Why 32 threads and not 288?
+
+A Grace node has 288 cores. Why do we cap at 32?
+
+Because compression is **memory-bandwidth-bound**, not compute-bound. Each codec call streams data through memory rather than doing dense arithmetic. A Grace node has 4 sockets, and the aggregate memory bandwidth saturates at roughly 32 well-distributed worker threads. Adding more threads beyond that doesn't speed anything up — they just queue waiting for memory. Direct testing on Santis confirmed this: a 9× larger thread budget came out 14% *slower* on production-size files due to cache pressure, and the result holds across multiple configurations.
+
+So 32 is the right number for this hardware and workload. It's also coincidentally Python's default thread-pool cap, so things line up nicely.
+
+### Codec-internal threading (`--codec-threads`, default off)
+
+The codec libraries themselves can use multiple threads inside a single encode call (Blosc has built-in support; zfp/EBCC use OpenMP). The `--codec-threads N` flag exposes this. We tested it; it doesn't help on production-size files for the same memory-bandwidth reason. Leave at the default (1) unless you have a specific reason and can A/B-test the change.
+
+### Summary for `evaluate_combos`
+
+```
+8 nodes × 32 threads × 1 codec-thread/encode = 256 cores of effective parallelism
+```
+
+Memory: each node holds one copy of the 5 GB sample (the bypass is what made this possible — the alternative of 32 ranks per node would duplicate the sample 32 times and OOM).
+
+---
+
+## `compress_with_optimal` and `compress_fields_from_results`: dask on chunks
+
+These commands take the winning codec from a sweep and apply it to persist a real field (or all fields). Unlike `evaluate_combos`, the work isn't 13,000 independent combos — it's *one* big array that needs to be encoded. So the parallelism strategy is different.
+
+### One Python process, no MPI
+
+Both commands explicitly run as a single Python process. They abort if accidentally launched with multiple MPI ranks, because the work doesn't decompose cleanly across ranks at the file level.
+
+### Parallelism via dask's threaded scheduler
+
+The compression of one big field is split into work units called **chunks** (one chunk = one codec call, ~16 MiB by default). Dask's threaded scheduler distributes those chunks across worker threads:
+
+```
+   Field (e.g., 5 GB)
+         │
+         ▼
+   Split into 320 chunks of 16 MiB each
+         │
+         ▼
+   Dask scheduler: 32 worker threads pull chunks from a queue
+         │
+         ▼
+   Each worker:  read chunk → encode chunk (in C) → emit encoded bytes
+         │
+         ▼
+   Encoded chunks bundled into shards (~512 MiB each, default)
+         │
+         ▼
+   Shards written to disk
+```
+
+Same parallelism mechanics as `evaluate_combos` at the per-thread level — codec calls in C release the GIL, 32 threads in 32 codec calls in parallel, etc. The difference is what the threads are *doing*: in `evaluate_combos` each thread drives a whole combo end-to-end; here each thread processes one chunk.
+
+### Chunks vs shards (a frequent confusion)
+
+These are two different cuts of the same data. Different jobs, different sizes:
+
+- **Chunk** (`--inner-chunk-mib`, default 16 MiB): the unit the codec works on. One codec call = one chunk. Smaller chunks have higher per-call overhead but more parallelism granularity; larger chunks compress more efficiently but use more memory.
+- **Shard** (`--shard-mib`, default 512 MiB): the unit zarr writes to disk. One shard = one file. Many chunks bundle into one shard so we don't end up with millions of tiny files (which would be a disaster on shared HPC filesystems).
+
+Dask operates on chunks. Shards are an output-side concern handled by zarr's writer. The two sizes are independent knobs.
+
+### Memory model
+
+Peak memory ≈ `--threads × --shard-mib`. With defaults that's `32 × 512 MiB = 16 GiB`, well under the per-node budget. If you bump `--shard-mib`, memory scales linearly.
+
+### `--codec-threads` here too
+
+Same flag as in `evaluate_combos`, same default (1), same guidance: leave at 1 unless you've measured otherwise.
+
+### Difference between the two commands
+
+| | `compress_with_optimal` | `compress_fields_from_results` |
+|---|---|---|
+| Reads the manifest from a sweep | Yes — for one field | Yes — for all fields |
+| Runs the encode for | One field | All fields, sequentially (one at a time) |
+| When to use | One-off persist | Bulk persist after a full sweep |
+
+The "batch" version processes fields one after another, not in parallel — so the per-field memory and CPU profile is the same as the single-field version. The benefit is opening the dataset once and reading manifests automatically.
+
+---
+
+## `from_nc_to_zarr` and `from_zarr_to_netcdf`: format conversion
+
+These convert between NetCDF and zarr. Same architecture as `compress_with_optimal`: single Python process, dask threaded scheduler with `--threads` workers operating on chunks.
+
+The differences from `compress_with_optimal`:
+
+- `from_nc_to_zarr` writes uncompressed zarr (no codecs), so there's no `--codec-threads` flag.
+- `from_zarr_to_netcdf` decodes the zarr (which involves running the codec stack in reverse), so it does have `--codec-threads`. It also has a `--max-size` safety guard against accidentally producing huge `.nc` files from compressed zarr stores.
+
+Both should be launched single-process: `srun -n 1 ...` or no srun.
+
+---
+
+## Light commands (Class C, summarized)
+
+- **`merge_compressed_fields`** — consolidates metadata on a `.zarr` store. Single-threaded, runs in seconds, just walks zarr's metadata structure.
+- **`open_zarr_and_inspect`** — read-only diagnostic that prints array shapes, dtypes, codec config. Single-threaded.
+- **`perform_clustering` / `analyze_clustering`** — k-means clustering on the small results table from a sweep. Single-threaded Python (with optional BLAS-internal threading if you don't pin `OMP_NUM_THREADS=1`).
+- **`plot_compression_errors`** — runs one codec configuration once for diagnostic plotting. Single-threaded.
+- **All UI commands** (`run_web_ui`, `run_local_ui`, `run_web_ui_vcluster`) — Streamlit web servers and SLURM submission helpers. Not compute-intensive; single-threaded.
+
+These commands are intentionally simple. They run on a login node or a small interactive allocation, complete quickly, and don't need parallelism.
+
+---
+
+## A few key invariants
+
+To keep the parallelism behaving correctly, the toolkit enforces some invariants at startup. Most of the time you don't need to think about them, but they're worth being aware of:
+
+- **One MPI rank per node** is the default for `evaluate_combos`. Multi-rank-per-node would duplicate the sample buffer once per rank and OOM on large fields.
+- **Codec env vars must be pinned to 1**:
+  ```bash
+  export OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 \
+         BLOSC_NTHREADS=1 NUMBA_NUM_THREADS=1 \
+         VECLIB_MAXIMUM_THREADS=1 OMP_THREAD_LIMIT=1
+  ```
+  This prevents the codec libraries from spawning their own internal thread pools that would compete with our outer threads. The toolkit aborts at startup if any of these are unset or non-1.
+- **`--threads × --codec-threads ≤ physical cores`** is enforced by a built-in product check.
+- **Single-rank commands** (everything except `evaluate_combos`) abort if launched with `mpirun -n >1` or `srun -n >1`.
+
+---
+
+## Where to read further
+
+- `santis.run` — the production driver script, with an inline comment block summarizing the experiments on Santis that informed the topology choices.
+- `src/dc_toolkit/cli.py` — the actual implementation of every command.
+- `src/dc_toolkit/utils.py` — the bypass machinery (`_AsyncBypass`, `_get_or_create_shared_executor`, `_get_thread_event_loop`) and the various safety checks (`check_thread_oversubscription`, `_check_thread_product`, the memory-headroom checks).
+
+---
+
+## TL;DR
+
+| Command | Parallelism in one line |
+|---|---|
+| `evaluate_combos` | N nodes × 32 threads/node, each thread handles one full combo, with the bypass to avoid zarr's internal scheduler bottleneck |
+| `compress_with_optimal` | One process, dask's threaded scheduler with 32 workers, each worker processes one chunk |
+| `compress_fields_from_results` | Same as above, fields done sequentially (one at a time) |
+| `from_nc_to_zarr` | Same as above, no codec stack (uncompressed write) |
+| `from_zarr_to_netcdf` | Same as above, with codec stack to decode |
+| `merge_compressed_fields` | Single-threaded metadata consolidation |
+| `open_zarr_and_inspect` | Single-threaded read-only inspection |
+| `perform_clustering`, `analyze_clustering` | Single-threaded clustering |
+| `plot_compression_errors` | Single-threaded diagnostic plotting |
+| UI commands | Single-threaded Streamlit / SLURM helpers |
+
+The interesting commands are `evaluate_combos` (uses MPI + threads + the bypass) and the four dask-driven write commands (single process + dask threaded scheduler on chunks). Everything else is intentionally simple.
diff --git a/install_dc_toolkit.sh b/install_dc_toolkit.sh
old mode 100755
new mode 100644
index 2c13c4a..7ca3d92
--- a/install_dc_toolkit.sh
+++ b/install_dc_toolkit.sh
@@ -22,3 +22,18 @@ git clone --recursive "$EBCC_REMOTE" "$EBCC_DIR"
 pushd "$EBCC_DIR"
 pip install -e ".[zarr]"
 popd
+
+# Thread-pinning is no longer baked into the venv.  Export the codec-internal
+# thread caps manually (e.g. in your sbatch script):
+#
+#   export OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 \
+#          BLOSC_NTHREADS=1 NUMBA_NUM_THREADS=1 \
+#          VECLIB_MAXIMUM_THREADS=1 OMP_THREAD_LIMIT=1
+#
+# dc_toolkit's oversubscription-check (on by default) catches any lapse.
+# Use --codec-threads N (where supported) to deliberately allow internal
+# codec threading; --threads * --codec-threads must stay <= physical cores.
+echo "[install] Done.  Remember to export thread-pinning env vars manually:"
+echo "[install]   export OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 \\"
+echo "[install]          BLOSC_NTHREADS=1 NUMBA_NUM_THREADS=1 \\"
+echo "[install]          VECLIB_MAXIMUM_THREADS=1 OMP_THREAD_LIMIT=1"
diff --git a/pyproject.toml b/pyproject.toml
index d3ef89f..aec7bbe 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -9,17 +9,17 @@ authors = [
 dependencies = [
     "click",
     "pyyaml",
-    "numpy==2.2.6",
+    "numpy",
     "xarray",
     "humanize",
     "streamlit",
     "tslearn",
     "scipy",
     "PyWavelets",
-    "dask[complete]==2025.7.0",
+    "dask[complete]",
     "h5py >=3.11",
-    "zarr >=3.1.1, <3.1.3",
-    "numcodecs <0.16.3",
+    "zarr >=3.1.1",
+    "numcodecs",
     "zarr-any-numcodecs >=0.1.3",
     "json5",
     "rich",
@@ -31,6 +31,7 @@ dependencies = [
     "h5netcdf",
     "plotly",
     "cfgrib",
+    "psutil",
     "numcodecs_wasm",
     "numcodecs_combinators",
     "numcodecs_observers",
@@ -58,10 +59,16 @@ version = '0.0'
 
 [project.scripts]
 dc_toolkit = 'dc_toolkit.cli:cli'
-model_eval_cscs_exclaim = 'model_eval_cscs_exclaim.cli:cli'
 
 [tool.setuptools]
 package-dir = {"" = "src"}
 
+[tool.setuptools.package-data]
+# Ship the L1 error threshold CSV inside the wheel.  Without this entry,
+# `pip install .` (non-editable) builds a wheel containing only the .py
+# files, and the runtime lookup at `Path(__file__).parent / "data" / ...`
+# in cli.py fails to find the local fallback.
+dc_toolkit = ["data/*.csv"]
+
 [tool.setuptools.packages.find]
 where = ["src"]
diff --git a/santis.run b/santis.run
index 28cc408..3601e1c 100644
--- a/santis.run
+++ b/santis.run
@@ -1,39 +1,226 @@
 #!/bin/bash
+#
+# Production codec sweep on Santis (Grace nodes, 288 cores/node).
+#
+# Topology: 1 MPI rank/node x 32 threads/rank, --bypass-zarr-sync.
+#
+# This setup was selected after extensive testing on Santis.  Three
+# alternatives were considered and rejected:
+#
+#   Alt A: codec-internal threading (--codec-threads > 1, larger chunks).
+#     Tested at --codec-threads=9 with --inner-chunk-mib=64 on a real
+#     production-size sweep (qv at R02B08, 4.6 GiB, 1000 combos).
+#     LOST by 14% vs baseline.  Cause: cache pressure and memory-
+#     bandwidth saturation; the codec-internal helper threads compete
+#     for the same memory channels as the outer parallelism.  A small
+#     in-cache file (120 MiB tot_prec) had shown a 17% gain at this
+#     setting, but the gain did not carry to production-size files.
+#     Lesson: codec-internal threading is not productive at production
+#     scales on this hardware.
+#
+#   Alt B: pure-MPI 1 rank x 1 thread, no bypass.
+#     Tested on the same qv file at 1000 combos.  Bypass 1x32 finished
+#     in 32 min; pure-MPI 1x1 was projected to 4+ hours (7.6x slower)
+#     and was canceled mid-run.  Cause: encode and decode DO parallelise
+#     in pure-MPI 1x1 — zarr's internal worker pool (32 threads) keeps
+#     32 chunk encodes in flight regardless of user-thread count, so
+#     those phases are fine.  But the metrics phase (numpy elementwise
+#     ops + L1/L2/Linf reductions) runs as plain numpy on the user
+#     thread itself, not under zarr.  With 1 user thread, only 1 core
+#     does all metrics work; with 32 user threads, 32 metrics
+#     computations run in parallel (one per user thread).  Metrics is
+#     ~39% of per-combo wall time, so this phase alone takes ~32x
+#     longer in 1x1, dominating the total slowdown.
+#
+#   Alt C: multi-rank-per-node (32 ranks x 1 thread).
+#     Faster per combo, but each rank duplicates the sample buffer in
+#     memory.  At 5 GB samples that is 32 x 5 GB = 160 GB per node,
+#     which OOMs on R02B08 and R02B10.  Bypass 1x32 keeps memory at
+#     1 sample per node and tolerates the modest throughput loss vs
+#     pure MPI in exchange for not OOMing.
+#
+# Verdict:  current setup is correct.  Do NOT bump --cpus-per-task,
+#          do NOT enable --codec-threads, do NOT switch topology
+#          without a fresh A/B test against this baseline.
+#
+# Codec space: numcodecs-wasm and EBCC are off by default (~13k combos
+# instead of ~26k); the full lossy family is still covered via
+# numcodecs.zarr3.ZFPY + BitRound + Quantize.
+#
+# See docs/PARALLELIZATION.md for a thorough walkthrough of the
+# parallelism architecture used by every dc_toolkit command.
 
-#SBATCH --account=...
-#SBATCH --uenv=prgenv-gnu/24.11:v2
+#SBATCH --account=YOUR_ACCOUNT_HERE
+#SBATCH --uenv=prgenv-gnu/26.3:v1
 #SBATCH --view=default
-#SBATCH --partition=debug
-#SBATCH --ntasks-per-node=32
-#SBATCH --nodes=1
-#SBATCH --time=00:30:00
+#SBATCH --partition=normal
+#SBATCH --nodes=8
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=32
+#SBATCH --time=12:00:00
 #SBATCH --output=out-%j.out
 #SBATCH --error=out-%j.out
 
+set -uo pipefail   # NOT -e: continue past per-file failures
+
 export PYTHONUNBUFFERED=1
 source venv/bin/activate
 
-srun dc_toolkit \
-     evaluate_combos \
-     /capstor/store/cscs/userlab/cwp03/ppothapa/Data_Dyamond_PostProcessed/DYAMOND_R02B10L120_main/out_15/temp_20200410T000000Z.nc \
-     ./res_folder \
-     --field-to-compress temp \
-     --field-percentage-to-compress 10 \
-     --override-existing-l1-error 0.002 \
-     --compressor-class Zlib \
-     --serializer-class PCodec \
-     --without-lossy \
-     --without-numcodecs-wasm \
-     --without-ebcc
-
-# srun -n 1 dc_toolkit \
-#      compress_with_optimal \
-#      /capstor/store/cscs/userlab/cwp03/ppothapa/Data_Dyamond_PostProcessed/DYAMOND_R02B10L120_main/out_15/temp_20200410T000000Z.nc \
-#      ./res_folder \
-#      temp \
-#      2 0 9 \
-#      --compressor-class Zlib \
-#      --serializer-class PCodec \
-#      --without-lossy \
-#      --without-numcodecs-wasm \
-#      --without-ebcc
+# Codec-internal threadpools pinned to 1.  The bypass machinery already
+# parallelises codec dispatch across the 32 user threads; OpenMP/blosc
+# adding their own would oversubscribe.
+export OMP_NUM_THREADS=1 \
+       MKL_NUM_THREADS=1 \
+       OPENBLAS_NUM_THREADS=1 \
+       BLOSC_NTHREADS=1 \
+       NUMBA_NUM_THREADS=1 \
+       VECLIB_MAXIMUM_THREADS=1 \
+       OMP_THREAD_LIMIT=1
+
+# ---------------------------------------------------------------------------
+# Tunables
+# ---------------------------------------------------------------------------
+ROOT_R02B10=/capstor/store1/cscs/userlab/cwp03/zemanc/Data_Dyamond_PostProcessed
+ROOT_R02B08=/capstor/store1/cscs/userlab/cwp03/zemanc/Data_Dyamond_PostProcessed_R02B08
+ROOT_R02B06=/capstor/store1/cscs/userlab/cwp03/zemanc/Data_Dyamond_PostProcessed_R02B06
+
+RESULTS_BASE="${RESULTS_BASE:-./res_folder_dyamond_${SLURM_JOB_ID:-$$}}"
+mkdir -p "$RESULTS_BASE"
+
+echo "[santis.run] topology: 1 rank/node x 32 threads/rank x ${SLURM_JOB_NUM_NODES:-?} nodes"
+
+EVAL_FLAGS=(
+    --eval-data-size-limit         5GB
+    --threads-per-rank             32
+)
+
+# Quick-test escape hatch: cap evals (e.g. MAX_EVALS=50 sbatch santis.run)
+MAX_EVALS="${MAX_EVALS:-}"
+if [[ -n "$MAX_EVALS" && "$MAX_EVALS" != "0" ]]; then
+    EVAL_FLAGS+=(--max-evals "$MAX_EVALS")
+    echo "[santis.run] --max-evals=$MAX_EVALS active (test mode)"
+fi
+
+# ---------------------------------------------------------------------------
+# File list
+# ---------------------------------------------------------------------------
+# Format:  <resolution>|<stream>|<filename>|<variable>|<l1_threshold>
+# L1 thresholds are in the variable's native units; rule of thumb ~0.5%
+# of typical std(field), aligned with Klöwer et al. (2021).
+LIST=(
+    # R02B06 backbone
+    "R02B06|out_1_1|remap_qv_20230121T000000Z.nc|qv|5e-6"          # 3D water vapor (remap)
+    "R02B06|out_1_2|remap_qi_20230121T000000Z.nc|qi|1e-8"          # 3D cloud ice
+    "R02B06|out_1_3|remap_temp_20230121T000000Z.nc|temp|0.05"      # 3D temperature
+    "R02B06|out_1_4|remap_w_20230121T000000Z.nc|w|5e-3"            # 3D vertical wind (remap)
+    "R02B06|out_2|tot_prec_20230121T000000Z.nc|tot_prec|1e-2"      # 2D 15-min precip
+    "R02B06|out_3|t_2m_20230121T000000Z.nc|t_2m|0.05"              # 2D 2-m temperature
+    "R02B06|out_4|w_20230121T000000Z.nc|w|5e-3"                    # 3D vertical wind (native)
+    "R02B06|out_6|sob_s_20230121T000000Z.nc|sob_s|1.0"             # 2D net SW at surface
+    "R02B06|out_8|remap_cape_20230121T000000Z.nc|cape|1.0"         # 2D CAPE
+    "R02B06|out_9|remap_t_so_20230121T000000Z.nc|t_so|0.05"        # Soil temperature
+    "R02B06|out_11|geopot_20230121T000000Z.nc|geopot|1.0"          # 3D geopot
+
+    # Cross-resolution spot-checks
+    "R02B08|out_1_1|remap_qv_20230121T000000Z.nc|qv|5e-6"
+    "R02B10|out_1_1|remap_qv_20230121T000000Z.nc|qv|5e-6"
+    "R02B10|out_2|tot_prec_20230121T000000Z.nc|tot_prec|1e-2"
+)
+
+resolve_root () {
+    case "$1" in
+        R02B06) echo "$ROOT_R02B06" ;;
+        R02B08) echo "$ROOT_R02B08" ;;
+        R02B10) echo "$ROOT_R02B10" ;;
+        *)      echo ""; return 1 ;;
+    esac
+}
+
+# ---------------------------------------------------------------------------
+# Driver loop
+# ---------------------------------------------------------------------------
+n_total=${#LIST[@]}
+n_ok=0; n_fail=0; n_skip=0
+declare -a FAILED_ENTRIES=()
+
+echo "======================================================================="
+echo "dc_toolkit evaluate_combos sweep over ${n_total} files"
+echo "Results base: $RESULTS_BASE"
+echo "Started at:   $(date -Is)"
+echo "======================================================================="
+
+for i in "${!LIST[@]}"; do
+    entry="${LIST[$i]}"
+    IFS='|' read -r RES_TAG STREAM FNAME VAR L1_THRESHOLD <<<"$entry"
+
+    ROOT=$(resolve_root "$RES_TAG") || {
+        echo "[$((i+1))/$n_total] SKIP unknown resolution tag '$RES_TAG'"
+        n_skip=$((n_skip+1))
+        continue
+    }
+    INPUT="$ROOT/$STREAM/$FNAME"
+    SUBDIR="${RES_TAG}_${STREAM}_${VAR}"
+    OUTDIR="$RESULTS_BASE/$SUBDIR"
+    LOG="$OUTDIR/run.log"
+
+    echo
+    echo "-----------------------------------------------------------------------"
+    echo "[$((i+1))/$n_total] $SUBDIR"
+    echo "  input    : $INPUT"
+    echo "  field    : $VAR"
+    echo "  l1 thresh: $L1_THRESHOLD"
+    echo "  outdir   : $OUTDIR"
+
+    if [[ ! -e "$INPUT" ]]; then
+        echo "  STATUS: SKIP - input file not found"
+        n_skip=$((n_skip+1))
+        FAILED_ENTRIES+=("SKIP  $SUBDIR  (missing input)")
+        continue
+    fi
+
+    mkdir -p "$OUTDIR"
+
+    srun --unbuffered dc_toolkit \
+         evaluate_combos \
+         "$INPUT" \
+         --where-to-write              "$OUTDIR" \
+         --field-to-compress           "$VAR" \
+         --override-existing-l1-error  "$L1_THRESHOLD" \
+         "${EVAL_FLAGS[@]}" \
+         2>&1 | stdbuf -oL -eL tee "$LOG"
+
+    rc=${PIPESTATUS[0]}
+    if [[ $rc -eq 0 ]]; then
+        echo "  STATUS: OK"
+        n_ok=$((n_ok+1))
+    else
+        echo "  STATUS: FAILED (exit $rc)"
+        n_fail=$((n_fail+1))
+        FAILED_ENTRIES+=("FAIL  $SUBDIR  (exit $rc)")
+    fi
+done
+
+# ---------------------------------------------------------------------------
+# Summary
+# ---------------------------------------------------------------------------
+echo
+echo "======================================================================="
+echo "Summary"
+echo "  total      : $n_total"
+echo "  ok         : $n_ok"
+echo "  failed     : $n_fail"
+echo "  skipped    : $n_skip"
+echo "  results    : $RESULTS_BASE"
+echo "  finished at: $(date -Is)"
+if (( n_fail + n_skip > 0 )); then
+    echo "  problems:"
+    for f in "${FAILED_ENTRIES[@]}"; do
+        echo "    $f"
+    done
+fi
+echo "======================================================================="
+
+if (( n_ok == 0 )); then
+    exit 1
+fi
+exit 0
diff --git a/src/dc_toolkit/cli.py b/src/dc_toolkit/cli.py
index b8ca6dc..54436e3 100644
--- a/src/dc_toolkit/cli.py
+++ b/src/dc_toolkit/cli.py
@@ -5,18 +5,21 @@
 #
 # Please, refer to the LICENSE file in the root directory.
 # SPDX-License-Identifier: BSD-3-Clause
+import hashlib
+import json
 import math
 import os
-import sys
 import io
-import traceback
-import click
-from tqdm import tqdm
+import sys
+import time
+from concurrent.futures import ThreadPoolExecutor, as_completed
 from pathlib import Path
-import math
-import zarr
-import shutil
 import itertools
+import subprocess
+import csv
+
+import click
+import zarr
 import numcodecs
 import numcodecs.zarr3
 import xarray as xr
@@ -25,32 +28,42 @@
 from ebcc.zarr_filter import EBCCZarrFilter
 import pandas as pd
 import numpy as np
-import matplotlib.pyplot as plt
-from sklearn.cluster import KMeans
-from sklearn.metrics import silhouette_score
 from mpi4py import MPI
 import dask
 import dask.array
-import plotly.io as pio
-import plotly.express as px
-import plotly.graph_objects as go
-from plotly.subplots import make_subplots
-import subprocess
 import humanize
+import psutil
+
+# Heavyweight optional imports (matplotlib / sklearn / plotly / tqdm) are
+# deferred: they are only used by the clustering and plotting commands below
+# and are imported lazily inside each of those commands.  Keeping them out of
+# the module-level import list means `dc_toolkit evaluate_combos` and
+# `compress_with_optimal` don't pay the import cost, and environments without
+# (e.g.) a matplotlib install can still run the main sweep.
 
 import warnings
 warnings.filterwarnings(
     "ignore",
-    message="Numcodecs codecs are not in the Zarr version 3 specification and may not be supported by other zarr implementations",
+    message="Numcodecs codecs are not in the Zarr version 3 specification.*",
     category=UserWarning,
-    module="numcodecs.zarr3"
 )
 warnings.filterwarnings(
-    "ignore",
-    message="Engine 'cfgrib' loading failed",
-    category=RuntimeWarning,
+    "ignore", message="Engine 'cfgrib' loading failed", category=RuntimeWarning,
 )
 warnings.filterwarnings("ignore", message="overflow encountered in square")
+# Cosmetic: at MPI step teardown, each rank's multiprocessing.resource_tracker
+# logs a "leaked semaphore" UserWarning because the parent dies before the
+# tracker has reaped its /dev/shm semaphores.  The semaphores are reclaimed
+# by the kernel at SLURM step end regardless, so the message is purely
+# noise; on a 256-rank job it produces several hundred lines that drown out
+# real warnings.  Suppress only this one specific message - leave the rest
+# of the UserWarning class active in case a real one appears elsewhere.
+warnings.filterwarnings(
+    "ignore",
+    message=r".*leaked semaphore objects.*",
+    category=UserWarning,
+    module=r"multiprocessing\.resource_tracker",
+)
 
 
 @click.group()
@@ -58,198 +71,1551 @@ def cli():
     pass
 
 
+# =============================================================================
+# Helpers specific to this CLI
+# =============================================================================
+
+def _size_option_callback(ctx, param, value):
+    if value is None:
+        return None
+    try:
+        return utils.parse_size(value)
+    except Exception as e:
+        raise click.BadParameter(f"Invalid size '{value}': {e}")
+
+
+def _is_ebcc_serializer(serializer) -> bool:
+    return (
+        isinstance(serializer, AnyNumcodecsArrayBytesCodec)
+        and isinstance(serializer.codec, EBCCZarrFilter)
+    )
+
+
+def _is_zfpy_serializer(serializer) -> bool:
+    return isinstance(serializer, numcodecs.zarr3.ZFPY)
+
+
+def _merged_store_path(where_to_write: str, dataset_file: str) -> str:
+    """One .zarr store per dataset; fields live as arrays inside.
+
+    Uses Path.stem so `foo.nc` -> `foo.zarr` (not `foo.nc.zarr`).  The store
+    name is derived only from the input filename, not its directory.
+    """
+    dataset_stem = Path(dataset_file).stem
+    return str(Path(where_to_write) / f"{dataset_stem}.zarr")
+
+
+def _version_banner(component_name: str) -> str:
+    """
+    Return a short string with the zarr version and a few other keys, for
+    debugging provenance.  Prints from rank 0 only in the CLI commands.
+    """
+    zarr_ver = getattr(zarr, "__version__", "unknown")
+    np_ver = getattr(np, "__version__", "unknown")
+    dask_ver = getattr(dask, "__version__", "unknown")
+    return (
+        f"[env] {component_name} | zarr={zarr_ver} | numpy={np_ver} | "
+        f"dask={dask_ver}"
+    )
+
+
+def _abort(code: int = 1) -> None:
+    """Abort cleanly: comm.Abort under multi-rank, sys.exit otherwise."""
+    if MPI.COMM_WORLD.Get_size() > 1:
+        MPI.COMM_WORLD.Abort(code)
+    else:
+        sys.exit(code)
+
+
+def _apply_codec_threads(codec_threads: int, rank: int = 0) -> None:
+    """
+    Apply --codec-threads at runtime.  Blosc respects set_nthreads live;
+    OpenMP/MKL/OpenBLAS read env vars at lib-init and need shell exports.
+    """
+    if codec_threads is None or int(codec_threads) <= 1:
+        return
+    n = int(codec_threads)
+    env_vars = [
+        "OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
+        "BLOSC_NTHREADS", "NUMBA_NUM_THREADS",
+        "VECLIB_MAXIMUM_THREADS", "OMP_THREAD_LIMIT",
+    ]
+    mismatched = [(v, os.environ.get(v)) for v in env_vars
+                  if os.environ.get(v) != str(n)]
+    if mismatched and rank == 0:
+        click.echo(
+            f"[codec-threads] requested {n}; for full effect, export the "
+            f"following in your shell BEFORE running (Blosc is set live; "
+            f"OpenMP/MKL/OpenBLAS need shell exports):"
+        )
+        for v, cur in mismatched:
+            shown = "<unset>" if cur is None else cur
+            click.echo(f"  {v}={shown} -> export {v}={n}")
+    try:
+        import numcodecs.blosc as _blosc
+        _blosc.set_nthreads(n)
+    except Exception as e:
+        if rank == 0:
+            click.echo(f"[codec-threads] WARNING: blosc.set_nthreads failed: {e}")
+
+
+def _check_thread_product(threads: int, codec_threads: int, rank: int = 0) -> None:
+    """Abort if --threads * --codec-threads exceeds physical cores."""
+    cores = utils.detect_cores_available()
+    product = int(threads) * max(1, int(codec_threads or 1))
+    if product > cores:
+        if rank == 0:
+            click.echo(
+                f"[oversubscription] --threads * --codec-threads = "
+                f"{int(threads)} * {int(codec_threads or 1)} = {product} "
+                f"exceeds physical cores ({cores}). Reduce one of the flags."
+            )
+        _abort(1)
+
+
+def _per_rank_steady_estimate_bytes(
+    sample_bytes: int,
+    threads_per_rank: int,
+    inner_chunk_mib: int,
+    include_ebcc_overhead: bool = False,
+) -> int:
+    """
+    Steady-state memory footprint of one MPI rank during the codec sweep.
+
+    Components, in order of size:
+      sample_bytes                          : the broadcast sample buffer,
+                                              alive for the entire sweep,
+                                              shared across all threads.
+      threads * sample_bytes *              : per-thread working set.
+        PER_THREAD_WORKING_FACTOR             Each ThreadPoolExecutor worker
+                                              runs evaluate_codec_pipeline
+                                              which allocates ITS OWN
+                                              decoded buffer (~1x sample),
+                                              ITS OWN MemoryStore of
+                                              encoded bytes (~0.01-1x
+                                              sample depending on codec
+                                              ratio), and small intermediate
+                                              codec scratch.  Pre-patch
+                                              treated this as 1x total
+                                              instead of threads x ~1.5x;
+                                              that under-count is what
+                                              produced the R02B10 out_15
+                                              OOMs (300 GiB fields at 32
+                                              threads).
+      threads * 2 * chunk_mib               : ThreadPoolExecutor float64
+                                              promotion in the metrics
+                                              loop; per-thread.
+      sample_bytes (EBCC)                   : optional float32 working
+                                              copy of the sample when
+                                              EBCC is in the search space
+                                              and the source dtype isn't
+                                              already float32.
+
+    This is the SAME formula `_max_sample_bytes_for_threads` inverts;
+    centralised here so the early-abort check, the user-facing banner,
+    and the auto-shrink budget can never drift apart.
+    """
+    threads = max(1, int(threads_per_rank))
+    decode_cache = int(
+        threads * sample_bytes * PER_THREAD_WORKING_FACTOR
+    )
+    thread_chunk = threads * 2 * max(1, int(inner_chunk_mib)) * (2 ** 20)
+    ebcc         = sample_bytes if include_ebcc_overhead else 0
+    return sample_bytes + decode_cache + thread_chunk + ebcc
+
+
+# Per-thread working-memory multiplier (decoded buffer + encoded
+# MemoryStore + codec scratch, in units of sample_bytes).  Empirically
+# 1.5x is a safe upper bound observed across the codec set:
+#   - decoded buffer:                    1.0x sample_bytes
+#   - encoded MemoryStore (worst case
+#     when ratio < 1 with bad codec):    0.0-1.0x sample_bytes (mean ~0.3x)
+#   - codec working/scratch:             0.1-0.3x sample_bytes
+# Raised at module level (not buried in the function) so the auto-shrink
+# inversion uses the SAME multiplier as the steady estimate -- they MUST
+# stay in lockstep.
+PER_THREAD_WORKING_FACTOR = 1.5
+
+
+def _max_sample_bytes_for_threads(
+    budget_bytes: int,
+    threads_per_rank: int,
+    inner_chunk_mib: int,
+    include_ebcc_overhead: bool = False,
+) -> int:
+    """
+    Inverse of `_per_rank_steady_estimate_bytes`: what's the largest
+    sample that fits within `budget_bytes`, given the thread/chunk
+    configuration?
+
+    Solving for sample:
+      sample
+      + threads * PER_THREAD_WORKING_FACTOR * sample
+      + threads * 2 * chunk_mib * MiB
+      + (sample if EBCC else 0)
+      <= budget
+
+    => sample * (1 + threads * factor + (1 if ebcc else 0))
+                + threads * 2 * chunk_mib * MiB
+       <= budget
+
+    => sample <= (budget - thread_chunk_bytes) / coeff
+
+    Returns 0 if no positive sample fits (caller should treat as
+    "cannot start sweep with this thread count").
+    """
+    threads = max(1, int(threads_per_rank))
+    coeff = (
+        1.0
+        + threads * PER_THREAD_WORKING_FACTOR
+        + (1.0 if include_ebcc_overhead else 0.0)
+    )
+    thread_chunk = threads * 2 * max(1, int(inner_chunk_mib)) * (2 ** 20)
+    available_for_sample = budget_bytes - thread_chunk
+    if available_for_sample <= 0:
+        return 0
+    return int(available_for_sample / coeff)
+
+
+def _detect_node_memory_budget() -> tuple[int, str]:
+    """
+    Return (bytes_available, source_description) for the effective
+    node-memory budget.
+
+    Order of preference:
+      1. cgroup v2 limit (/sys/fs/cgroup/memory.max).  This is what
+         actually OOM-kills tasks under SLURM when --mem or
+         --mem-per-cpu is set, or under containers; psutil cannot
+         see it.
+      2. cgroup v1 limit (/sys/fs/cgroup/memory/memory.limit_in_bytes).
+         cgroup v1 stores a sentinel (~2^63) for "unlimited"; we
+         treat any value larger than 2x the host total as unlimited
+         and fall through.
+      3. sysconf SC_PHYS_PAGES * SC_PAGE_SIZE - the raw host total.
+         Last resort, accurate on bare metal.
+
+    Why not psutil.virtual_memory().available: that reports HOST
+    memory and ignores the cgroup limit.  On Santis the production
+    OOMs (job 843234) happened with ~290 GiB peak across 32 ranks
+    on a node nominally rated 480 GiB - the actual binding constraint
+    was the per-task cgroup, not the host total, and psutil missed it.
+    """
+    # cgroup v2 (unified hierarchy)
+    try:
+        with open("/sys/fs/cgroup/memory.max") as fh:
+            val = fh.read().strip()
+        if val and val != "max":
+            return int(val), "cgroup v2 memory.max"
+    except (OSError, ValueError):
+        pass
+
+    # cgroup v1 (legacy hierarchy)
+    for path in (
+        "/sys/fs/cgroup/memory/memory.limit_in_bytes",
+        "/sys/fs/cgroup/memory.limit_in_bytes",
+    ):
+        try:
+            with open(path) as fh:
+                val = int(fh.read().strip())
+        except (OSError, ValueError):
+            continue
+        # cgroup v1 reports a near-2^63 sentinel for "unlimited"; if
+        # the value is wildly larger than the host total, treat it as
+        # unset and fall through.
+        try:
+            host_total = os.sysconf("SC_PHYS_PAGES") * os.sysconf("SC_PAGE_SIZE")
+        except (OSError, ValueError):
+            host_total = 0
+        if host_total and val < host_total * 2:
+            return val, f"cgroup v1 ({path})"
+        # else: looks like the unlimited sentinel; fall through
+
+    # sysconf host total
+    try:
+        return (
+            os.sysconf("SC_PHYS_PAGES") * os.sysconf("SC_PAGE_SIZE"),
+            "host total RAM (sysconf)",
+        )
+    except (OSError, ValueError):
+        # Extremely rare; psutil as last resort.
+        return psutil.virtual_memory().total, "psutil host total"
+
+
+def _check_node_memory_headroom(
+    per_rank_steady_bytes: int,
+    ranks_on_node: int,
+    rank: int,
+    label: str,
+    threshold: float = 0.80,
+) -> None:
+    """
+    Memory check that adapts to whether the budget is per-task or per-node.
+
+    Under SLURM with cgroup-v2 task plugin, /sys/fs/cgroup/memory.max is
+    PER-TASK: each rank has its own cgroup with that limit, so the right
+    comparison is `per_rank_steady > threshold * budget`.  Under host-total
+    (no cgroup), all ranks share node RAM and the right comparison is
+    `ranks_on_node * per_rank_steady > threshold * budget`.
+
+    Only rank 0 evaluates and emits; comm.Abort propagates termination.
+    """
+    if rank != 0:
+        return
+
+    available, source = _detect_node_memory_budget()
+    is_cgroup = source.startswith("cgroup")
+
+    if is_cgroup:
+        required = per_rank_steady_bytes
+        scope_label = "per-rank (cgroup is per-task under SLURM)"
+    else:
+        required = max(1, ranks_on_node) * per_rank_steady_bytes
+        scope_label = f"per-node ({ranks_on_node} rank(s) x per-rank)"
+
+    if required > threshold * available:
+        click.echo(
+            f"[memcheck] REFUSING to start sweep: memory requirement "
+            f"{humanize.naturalsize(required, binary=True)} "
+            f"({scope_label}, "
+            f"{humanize.naturalsize(per_rank_steady_bytes, binary=True)} "
+            f"steady-state per rank) exceeds {int(threshold*100)}% of the "
+            f"detected budget {humanize.naturalsize(available, binary=True)} "
+            f"({source}).\n"
+            f"  Context: {label}\n"
+            f"  Fixes (any one):\n"
+            f"    - lower --ntasks-per-node in SBATCH\n"
+            f"    - lower --eval-data-size-limit (smaller sample buffer)\n"
+            f"    - request more RAM with #SBATCH --mem=0 (whole node) or "
+            f"--mem=<n>G\n"
+            f"    - drop --allow-multi-rank-per-node (1 rank/node + threads)\n"
+            f"  Override: raise --memory-threshold (default 0.80, max 0.95)."
+        )
+        _abort(1)
+
+
+_MEMCHECK_WARNED_HIGH = False
+
+
+def _reset_memcheck_state() -> None:
+    """Reset module-level memcheck state.  Call at the start of each command."""
+    global _MEMCHECK_WARNED_HIGH
+    _MEMCHECK_WARNED_HIGH = False
+
+
+def _check_memory_headroom(required_bytes: int, label: str, threshold: float = 0.80) -> None:
+    """
+    Refuse to allocate `required_bytes` if it would exceed `threshold` of
+    currently-available RAM.  Aborts the whole MPI world on violation.
+
+    Default threshold is 0.80: leaves headroom for rechunk transients
+    (1.5-2x), Python/dask/MPI overhead, and other processes.
+
+    Caveat: psutil.virtual_memory().available reports HOST memory, not the
+    cgroup limit when running inside a container or a slurm allocation with
+    --mem set.  In that case the kernel/slurm OOM-killer is the real guard.
+    """
+    global _MEMCHECK_WARNED_HIGH
+    if threshold > 0.80 and not _MEMCHECK_WARNED_HIGH:
+        click.echo(
+            f"[memcheck] WARNING: threshold {threshold:.2f} exceeds the "
+            f"recommended 0.80 ceiling.  The 1.5-2x rechunk transient "
+            f"documented for the write peak can fit inside the remaining "
+            f"buffer up to ~0.80 but not above; OOM risk increases sharply."
+        )
+        _MEMCHECK_WARNED_HIGH = True
+    try:
+        avail = psutil.virtual_memory().available
+    except Exception as e:
+        click.echo(
+            f"[memcheck] WARNING: could not query available memory ({e}); "
+            f"skipping guard for {label}."
+        )
+        return
+    if required_bytes > threshold * avail:
+        click.echo(
+            f"[memcheck] REFUSING to proceed: {label} needs "
+            f"{humanize.naturalsize(required_bytes, binary=True)}, which "
+            f"exceeds {int(threshold*100)}% of currently-available RAM "
+            f"({humanize.naturalsize(avail, binary=True)}).\n"
+            f"  Reduce the relevant flag (e.g. --eval-data-size-limit, "
+            f"--threads, --shard-mib), raise --memory-threshold (max 0.95), "
+            f"or run on a larger node."
+        )
+        _abort(1)
+
+
+def _sample_signature(
+    dataset_file: str,
+    var: str,
+    eval_data_size_limit: int,
+    sample_np: np.ndarray,
+) -> dict:
+    """
+    Build a compact, deterministic signature of the representative sample
+    used to parameterise the codec space.
+
+    Purpose: the codec-space indices written by `evaluate_combos` are only
+    valid in `compress_with_optimal` when both commands see an *identical*
+    sample - because statistics (Asinh.linear_width, FixedOffsetScale.offset/
+    scale, EBCC chunk geometry) are derived from that sample.  We hash enough
+    of the sample's identity to detect a mismatch at the start of
+    `compress_with_optimal` and refuse to continue silently.
+
+    Design:
+    - Hashes the full buffer bytes (sha256).  Fast enough for 5 GB (~5s on
+      modern CPUs); the alternative of hashing summary stats can alias on
+      pathological data.  We pay this once per variable, once per command.
+    - Includes `(shape, dtype, dataset_stem, var, eval_data_size_limit)`
+      alongside the content hash so a debug message can point at the
+      mismatch cause.
+
+    Returns a plain dict (json-serialisable).
+
+    Unsupported: object-dtype arrays.  numpy stores object arrays as a
+    buffer of pointer addresses, not their referents, so the hash would
+    include process-local memory addresses and be unreproducible across
+    runs.  We refuse early rather than silently emit a garbage signature.
+    Climate data is never object-dtype in practice, so this is defensive.
+    """
+    if sample_np.dtype == object:
+        raise ValueError(
+            "_sample_signature does not support object-dtype arrays: the "
+            "buffer holds pointer addresses, not values, so the hash would "
+            "be process-local and not reproducible across runs."
+        )
+    h = hashlib.sha256()
+    # memoryview over the numpy buffer avoids an extra copy.  We force
+    # contiguity at the broadcast site, so the buffer is already C-ordered.
+    mv = memoryview(np.ascontiguousarray(sample_np)).cast("B")
+    # Stream in 64 MiB chunks to keep the worst-case transient allocation low.
+    step = 64 * 1024 * 1024
+    n = len(mv)
+    for i in range(0, n, step):
+        h.update(mv[i:i + step])
+    return {
+        "dataset_stem": Path(dataset_file).stem,
+        "var": var,
+        "eval_data_size_limit": int(eval_data_size_limit),
+        "shape": list(sample_np.shape),
+        "dtype": str(sample_np.dtype),
+        "nbytes": int(sample_np.nbytes),
+        "sha256": h.hexdigest(),
+    }
+
+
+def _signature_path(where_to_write: str, var: str) -> Path:
+    return Path(where_to_write) / f"sample_signature_{var}.json"
+
+
+# =============================================================================
+# L1 error threshold lookup
+# =============================================================================
+# Source of truth: a Google Sheet ("Short Name", "Unit", "Existing L1 error").
+# Mirrored to a CSV next to this module for offline / CI fallback.
+
+_L1_THRESHOLDS_SHEET_ID = "1lHcX-HE2WpVCOeKyDvM4iFqjlWvkd14lJlA-CUoCxMM"
+_L1_THRESHOLDS_SHEET_URL = (
+    f"https://docs.google.com/spreadsheets/d/{_L1_THRESHOLDS_SHEET_ID}/export?format=csv"
+)
+_LOCAL_L1_THRESHOLDS_PATH = (
+    Path(__file__).parent / "data" / "l1_error_thresholds.csv"
+)
+
+
+def _load_l1_error_thresholds(rank: int) -> "pd.DataFrame | None":
+    """
+    Load the L1 error threshold table.
+
+    Resolution order:
+      1. Remote Google Sheet (authoritative).  On success the result is
+         mirrored to `_LOCAL_L1_THRESHOLDS_PATH` so the next offline run
+         has an up-to-date snapshot.
+      2. Local CSV at `_LOCAL_L1_THRESHOLDS_PATH` (committed to the repo).
+         Used only when the remote fetch raises - typically: no network,
+         the sheet was renamed/permissioned, or the CSV export endpoint
+         is briefly down.
+
+    Returns the DataFrame on success, or `None` if neither source yields
+    one.  The caller is responsible for aborting with a useful message in
+    that case.
+
+    Only the calling rank reads/writes; the resulting DataFrame is
+    broadcast by the caller (we do not enter MPI here so the helper stays
+    usable from non-MPI contexts as well).
+
+    The local cache write is best-effort: if the package directory is
+    read-only (e.g. installed into a system site-packages, sandboxed CI),
+    we log a note and still return the in-memory DataFrame.  The remote
+    payload is the same data, so a failed cache is not a fatal condition.
+    """
+    # ---- 1) Try remote --------------------------------------------------
+    try:
+        thresholds = pd.read_csv(_L1_THRESHOLDS_SHEET_URL)
+    except Exception as remote_err:
+        click.echo(
+            f"[Rank {rank}] Remote L1 threshold fetch failed: {remote_err}.  "
+            f"Trying local fallback at {_LOCAL_L1_THRESHOLDS_PATH}."
+        )
+    else:
+        # Refresh the on-disk fallback.  Best-effort - see docstring.
+        try:
+            _LOCAL_L1_THRESHOLDS_PATH.parent.mkdir(parents=True, exist_ok=True)
+            thresholds.to_csv(_LOCAL_L1_THRESHOLDS_PATH, index=False)
+        except Exception as cache_err:
+            click.echo(
+                f"[Rank {rank}] Note: could not refresh local L1 threshold "
+                f"cache at {_LOCAL_L1_THRESHOLDS_PATH} ({cache_err}); "
+                f"continuing with the remote copy."
+            )
+        return thresholds
+
+    # ---- 2) Try local fallback -----------------------------------------
+    if _LOCAL_L1_THRESHOLDS_PATH.is_file():
+        try:
+            thresholds = pd.read_csv(_LOCAL_L1_THRESHOLDS_PATH)
+            click.echo(
+                f"[Rank {rank}] Loaded L1 thresholds from local fallback "
+                f"({_LOCAL_L1_THRESHOLDS_PATH})."
+            )
+            return thresholds
+        except Exception as local_err:
+            click.echo(
+                f"[Rank {rank}] Local L1 threshold fallback at "
+                f"{_LOCAL_L1_THRESHOLDS_PATH} is unreadable: {local_err}."
+            )
+    else:
+        click.echo(
+            f"[Rank {rank}] No local L1 threshold fallback found at "
+            f"{_LOCAL_L1_THRESHOLDS_PATH}."
+        )
+    return None
+
+
 @cli.command("evaluate_combos")
 @click.argument("dataset_file", type=click.Path(exists=True, dir_okay=True, file_okay=True))
-@click.argument("where_to_write", type=click.Path(dir_okay=True, file_okay=False, exists=False))
-@click.option("--field-to-compress", default=None, help="Field to compress [if not given, all fields will be compressed].")
-@click.option("--field-percentage-to-compress", default=None, callback=utils.validate_percentage, help="Compress a percentage of the field [1-99%]. If not given, the whole field will be compressed.")
-@click.option("--override-existing-l1-error", type=float, default=None, help="Override the existing L1 error threshold from the lookup table. If provided, this value will be used instead of the spreadsheet value.")
-@click.option("--compressor-class", default="all", help="Compressor class to use (case insensitive), i.e. specified one instead of the full list `all` [`none` skips all compressors].")
-@click.option("--filter-class", default="all", help="Filter class to use (case insensitive), i.e. specified one instead of the full list `all` [`none` skips all filters].")
-@click.option("--serializer-class", default="all", help="Serializer class to use (case insensitive), i.e. specified one instead of the full list `all` [`none` skips all serializers].")
-@click.option("--with-lossy/--without-lossy", default=True, show_default=True, help="Enable or disable lossy compressors/filters/serializers.")
-@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=True, show_default=True, help="Enable or disable Numcodecs-wasm codecs.")
-@click.option("--with-ebcc/--without-ebcc", default=True, show_default=True, help="Enable or disable EBCC serializer.")
-def evaluate_combos(dataset_file: str, where_to_write: str, 
-                    field_to_compress: str | None = None, field_percentage_to_compress: str | None = None, override_existing_l1_error: float | None = None,
-                    compressor_class: str = "all", filter_class: str = "all", serializer_class: str = "all",
-                    with_lossy: bool = True, with_numcodecs_wasm: bool = True, with_ebcc: bool = True):
-    """
-    Loop over combinations of compressors, filters, and serializers to find the optimal configuration for compressing a given field in a dataset file.
-
-    List of compressors : Blosc, LZ4, Zstd, Zlib, GZip, BZ2, LZMA \n
-    List of filters     : Delta, BitRound, Quantize, Asinh, FixedOffsetScale \n
-    List of serializers : PCodec, ZFPY, EBCCZarrFilter, Zfp
-
-    \b
-    Args:
-        dataset_file (str): Path to the input dataset file.
-        where_to_write (str): Directory where the output files will be written.
-        field_to_compress: --field-to-compress
-        field_percentage_to_compress: --field-percentage-to-compress
-        override_existing_l1_error: --override-existing-l1-error
-        compressor_class: --compressor-class
-        filter_class: --filter-class
-        serializer_class: --serializer-class
-        with_lossy: --with-lossy/--without-lossy
-        with_numcodecs_wasm: --with-numcodecs-wasm/--without-numcodecs-wasm
-        with_ebcc: --with-ebcc/--without-ebcc
+@click.option("--where-to-write", "where_to_write", required=True,
+              type=click.Path(dir_okay=True, file_okay=False, exists=False),
+              help="Directory where sweep outputs are written: per-var config "
+                   "space CSV, per-rank streaming partials, consolidated "
+                   "`results_{var}.parquet`, and the legacy scored-results "
+                   "`.npy`.  The directory is created if it doesn't exist.")
+@click.option("--field-to-compress", default=None,
+              help="Field to compress [if not given, all fields will be evaluated].")
+@click.option("--eval-data-size-limit", default="5GB", callback=_size_option_callback,
+              show_default=True,
+              help="Sample size budget (e.g. '5GB', '512MiB').  If the field "
+                   "fits, the full field is used; otherwise a strided "
+                   "subsample along the leading dim.  Must match the value "
+                   "passed to compress_with_optimal for codec-space indices "
+                   "to resolve identically.")
+@click.option("--threads-per-rank", type=int, default=None,
+              help="Threads per MPI rank.  Default: auto-detected from cores/rank.")
+@click.option("--codec-threads", type=int, default=1, show_default=True,
+              help="Internal threads per codec call (Blosc set live; for "
+                   "OpenMP/MKL/OpenBLAS export the matching env vars in the "
+                   "shell BEFORE running). --threads-per-rank * --codec-threads "
+                   "must be <= physical cores; oversubscription-check is "
+                   "skipped when this is > 1.")
+@click.option("--inner-chunk-mib", type=int, default=16, show_default=True,
+              help="Target zarr chunk size (MiB) during evaluation.  Pass "
+                   "the same value to compress_with_optimal for measured "
+                   "ratios to reflect production.")
+@click.option("--max-inner-chunk-mib", type=int, default=256, show_default=True,
+              help="Hard ceiling on inner chunk size (MiB) when "
+                   "--no-spatial-split is set; warns if exceeded.")
+@click.option("--spatial-split/--no-spatial-split", default=True, show_default=True,
+              help="Split spatial dims when one timestep exceeds "
+                   "--inner-chunk-mib (horizontal first, vertical last).")
+@click.option("--oversubscription-check/--no-oversubscription-check", default=True,
+              show_default=True,
+              help="At startup, warn/abort if OMP/BLOSC/MKL thread vars aren't pinned to 1.")
+@click.option("--memory-threshold", type=click.FloatRange(0.05, 0.95), default=0.80,
+              show_default=True,
+              help="Fraction of available RAM any single tracked allocation "
+                   "may occupy before the run aborts.  Values above 0.80 "
+                   "emit a one-time warning.")
+@click.option("--override-existing-l1-error", type=float, default=None,
+              help="L1 error threshold fallback when the variable isn't in "
+                   "the lookup table.  Table values win when present.")
+@click.option("--compressor-class", default="all",
+              help="Compressor class (case-insensitive) or 'none' to skip.")
+@click.option("--filter-class", default="all",
+              help="Filter class (case-insensitive) or 'none' to skip.")
+@click.option("--serializer-class", default="all",
+              help="Serializer class (case-insensitive) or 'none' to skip.")
+@click.option("--with-lossy/--without-lossy", default=True, show_default=True)
+@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=False, show_default=True)
+@click.option("--with-ebcc/--without-ebcc", default=False, show_default=True)
+@click.option("--resume/--no-resume", default=True, show_default=True,
+              help="If a `config_space_{var}_rank{rank}.csv` already exists, "
+                   "skip combos already present in it (matched by indices).")
+@click.option("--max-evals", type=int, default=None,
+              help="Cap total evaluations across all ranks.  Useful for "
+                   "quick test runs.  Slicing happens before rank partition.")
+@click.option("--allow-multi-rank-per-node/--no-allow-multi-rank-per-node",
+              default=False, show_default=True,
+              help="Allow more than one MPI rank to share a node.  Each rank "
+                   "holds its own copy of the sample, so per-node memory "
+                   "scales as ranks_on_node * sample_size — ensure the node "
+                   "has the headroom.")
+@click.option("--bypass-zarr-sync/--no-bypass-zarr-sync", default=True, show_default=True,
+              help="Route codec dispatch through zarr's async API on per-"
+                   "thread persistent event loops (shared bounded executor "
+                   "as default).  Bypasses zarr 3's sync() loop which "
+                   "otherwise serialises threads.  Required for thread-only "
+                   "topologies (1 rank x N threads); on by default to match "
+                   "the production HPC strategy.  Aborts at startup if "
+                   "zarr.api.asynchronous is not importable.")
+def evaluate_combos(dataset_file,
+                    where_to_write,
+                    field_to_compress, eval_data_size_limit,
+                    threads_per_rank, codec_threads,
+                    inner_chunk_mib,
+                    max_inner_chunk_mib, spatial_split,
+                    oversubscription_check,
+                    memory_threshold,
+                    override_existing_l1_error,
+                    compressor_class, filter_class, serializer_class,
+                    with_lossy, with_numcodecs_wasm, with_ebcc,
+                    resume, max_evals, allow_multi_rank_per_node,
+                    bypass_zarr_sync):
     """
-    dask.config.set(scheduler="single-threaded")
-    dask.config.set(array__chunk_size="512MiB")
+    Sweep compressor x filter x serializer combinations on a representative
+    sample of the field to find the best configuration.
+
+    Parallelism
+    -----------
+    - MPI ranks partition the config space (config_space[rank::size]).
+    - Within each rank, a ThreadPoolExecutor runs N configs concurrently.
+    - Recommended launch:
+        mpirun -n <NODES> --ntasks-per-node=1 dc_toolkit evaluate_combos ...
+      (or srun --nodes=<N> --ntasks-per-node=1 ... on Slurm).
+    - 1 MPI rank per node is REQUIRED.  Multi-rank-per-node launches are
+      rejected at startup; within-node parallelism is provided by threads,
+      not MPI.
+    - Cores-per-rank is auto-detected via sched_getaffinity; override with
+      --threads-per-rank.
+
+    The evaluation runs entirely in memory (MemoryStore) - no disk I/O per
+    combo.  Use `compress_with_optimal` afterwards to materialise the winner
+    against the full field.
+    """
+    # -------------------------------------------------------------------------
+    # Topology + dask config
+    # -------------------------------------------------------------------------
+    _reset_memcheck_state()
     comm = MPI.COMM_WORLD
     rank = comm.Get_rank()
     size = comm.Get_size()
 
-    os.makedirs(where_to_write, exist_ok=True) 
+    node_comm, ranks_on_node, _local_rank = utils.detect_node_topology(comm)
 
-    if rank == 0:
+    # ---- 1 MPI rank per node: opt-in bypass -----------------------------
+    # The original design intends shared-memory threading within each node
+    # (1 Python process / 1 GIL).  In practice the GIL + codec config registry
+    # + glibc malloc arenas serialize so heavily on aarch64 (Grace) that
+    # multiple Python processes per node beat threads despite paying for
+    # sample duplication.  --allow-multi-rank-per-node is the explicit knob
+    # for that case.
+    if ranks_on_node > 1 and not allow_multi_rank_per_node:
+        if rank == 0:
+            click.echo(
+                f"[topology] ERROR: detected {ranks_on_node} MPI rank(s) per node.\n"
+                f"  This toolkit defaults to exactly 1 rank per node; within-node\n"
+                f"  parallelism is provided by threads, not MPI.\n"
+                f"  Relaunch with one of:\n"
+                f"    --ntasks-per-node=1   (default behaviour, threading only)\n"
+                f"    --allow-multi-rank-per-node   (opt in - acknowledges sample\n"
+                f"                                   duplication; recommended on Grace\n"
+                f"                                   when GIL serialization dominates)\n"
+            )
+        comm.Abort(1)
+    if ranks_on_node > 1 and rank == 0:
+        click.echo(
+            f"[topology] NOTE: running {ranks_on_node} MPI rank(s) per node "
+            f"(--allow-multi-rank-per-node is set).  Each rank will hold its "
+            f"own copy of the sample; per-node memory ~ {ranks_on_node} * sample_size."
+        )
+
+    try:
+        node_comm.Free()
+    except Exception:
+        pass
+
+    cores_avail = utils.detect_cores_available()
+    if threads_per_rank is None:
+        threads_per_rank = utils.compute_default_threads_per_rank(ranks_on_node, cores_avail)
+
+    if bypass_zarr_sync:
         try:
-            # Lookup table for valid thresholds
-            # https://docs.google.com/spreadsheets/d/1lHcX-HE2WpVCOeKyDvM4iFqjlWvkd14lJlA-CUoCxMM
-            sheet_id = "1lHcX-HE2WpVCOeKyDvM4iFqjlWvkd14lJlA-CUoCxMM"
-            sheet_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv"
-            thresholds = pd.read_csv(sheet_url)
-        except Exception as e:
-            print(f"[Rank 0] Failed to fetch thresholds: {e}")
-            sys.exit(1)
-        # Convert DataFrame to bytes for broadcasting
-        buffer = io.BytesIO()
-        thresholds.to_parquet(buffer, index=False)
-        data_bytes = buffer.getvalue()
-    else:
-        data_bytes = None
+            utils.AsyncBypass.enable(threads_per_rank=threads_per_rank)
+        except RuntimeError as e:
+            if rank == 0:
+                click.echo(f"[bypass-zarr-sync] ERROR: {e}")
+            comm.Abort(1)
+        if rank == 0:
+            click.echo("[bypass-zarr-sync] enabled.")
 
-    data_bytes = comm.bcast(data_bytes, root=0)
+    _apply_codec_threads(codec_threads, rank=rank)
+    _check_thread_product(threads_per_rank, codec_threads, rank=rank)
+    if int(codec_threads or 1) <= 1:
+        utils.check_thread_oversubscription(
+            abort_if_unsafe=oversubscription_check, rank=rank,
+        )
 
-    if rank != 0:
-        buffer = io.BytesIO(data_bytes)
-        thresholds = pd.read_parquet(buffer)
+    # Create the output directory once on rank 0, then barrier so all ranks
+    # see it before anyone tries to write into it.
+    if rank == 0:
+        os.makedirs(where_to_write, exist_ok=True)
+    comm.Barrier()
+
+    # `array.chunk-size` must be set before any open() that uses chunks="auto".
+    # This outer block governs only the dataset open and the rank-0 sample
+    # .compute(); the inner sweep below opens its own
+    # `with dask.config.set(scheduler="synchronous")` so per-combo threads
+    # don't nest dask thread pools.
+    with dask.config.set({
+        "array.chunk-size": "512MiB",
+        "scheduler": "threads",
+        "num_workers": threads_per_rank,
+    }):
 
-    # This is opened by all MPI processes -lazy evaluation with Dask backend-
-    ds = utils.open_dataset(dataset_file, field_to_compress, field_percentage_to_compress, rank=rank)
+        if rank == 0:
+            click.echo(_version_banner("evaluate_combos"))
 
-    for var in ds.data_vars:
-        if field_to_compress is not None and field_to_compress != var:
-            continue
-        da = ds[var]
-
-        if override_existing_l1_error is None:
-            lookup = var
-            threshold_row = thresholds[thresholds["Short Name"] == lookup]
-            matching_units = threshold_row.iloc[0]["Unit"] == da.attrs.get("units", None) if not threshold_row.empty else None
-            existing_l1_error = threshold_row.iloc[0]["Existing L1 error"] if not threshold_row.empty and matching_units else None
-            existing_l1_error = float(existing_l1_error.replace(",", ".")) if existing_l1_error else None
+        # Fetch threshold table on rank 0, broadcast.  Falls back to a local
+        # CSV if the remote sheet fails; ultimate fallback is the per-call
+        # --override-existing-l1-error value.
+        thresholds = None
+        if rank == 0:
+            thresholds = _load_l1_error_thresholds(rank=rank)
+            if thresholds is None and override_existing_l1_error is None:
+                click.echo(
+                    "[Rank 0] ERROR: could not load the L1 error "
+                    "threshold table from either the remote Google "
+                    "Sheet or the local fallback at "
+                    f"{_LOCAL_L1_THRESHOLDS_PATH}, and no "
+                    "--override-existing-l1-error was supplied as a "
+                    "fallback.\n"
+                    "  Re-run with --override-existing-l1-error <value> "
+                    "to specify a fallback threshold explicitly, or fix "
+                    "connectivity / restore the local CSV before "
+                    "retrying."
+                )
+                comm.Abort(1)
+
+            if thresholds is not None:
+                buffer = io.BytesIO()
+                thresholds.to_parquet(buffer, index=False)
+                data_bytes = buffer.getvalue()
+            else:
+                # Table unavailable but a fallback override is set; broadcast
+                # an empty payload so non-root ranks know to skip the lookup.
+                data_bytes = b""
         else:
-            existing_l1_error = override_existing_l1_error
+            data_bytes = None
 
-        if rank == 0:
-            click.echo(f"Processing variable: {var} (Units: {da.attrs.get('units', 'N/A')}, Existing L1 Error: {existing_l1_error})")
+        data_bytes = comm.bcast(data_bytes, root=0)
+        if rank != 0:
+            thresholds = (
+                pd.read_parquet(io.BytesIO(data_bytes))
+                if data_bytes else None
+            )
 
-        if field_percentage_to_compress is not None:
-            field_percentage_to_compress = float(field_percentage_to_compress)
-            slices = {dim: slice(0, max(1, int(size * (field_percentage_to_compress / 100)))) for dim, size in da.sizes.items()}
-            da = da.isel(**slices)
+        # -------------------------------------------------------------------------
+        # Open dataset (lazy; shared by all ranks; each rank gets its own handle)
+        # -------------------------------------------------------------------------
+        ds = utils.open_dataset(dataset_file, field_to_compress, rank=rank)
+
+        # -------------------------------------------------------------------------
+        # Per-variable loop
+        # -------------------------------------------------------------------------
+        for var in ds.data_vars:
+            if field_to_compress is not None and field_to_compress != var:
+                continue
+            da = ds[var]
+
+            # Try the lookup table first.  Fall back to
+            # --override-existing-l1-error only when the variable is missing
+            # from the table, the units don't match, or the cell is empty / NaN.
+            existing_l1_error = None
+            threshold_source = None
+            if thresholds is not None:
+                threshold_row = thresholds[thresholds["Short Name"] == var]
+                matching_units = (
+                    threshold_row.iloc[0]["Unit"] == da.attrs.get("units", None)
+                    if not threshold_row.empty else None
+                )
+                raw_threshold = (
+                    threshold_row.iloc[0]["Existing L1 error"]
+                    if not threshold_row.empty and matching_units else None
+                )
+                if raw_threshold is not None and not (
+                    isinstance(raw_threshold, float) and math.isnan(raw_threshold)
+                ):
+                    if isinstance(raw_threshold, str):
+                        existing_l1_error = float(raw_threshold.replace(",", "."))
+                    else:
+                        existing_l1_error = float(raw_threshold)
+                    threshold_source = "lookup table"
+
+            if existing_l1_error is None and override_existing_l1_error is not None:
+                existing_l1_error = override_existing_l1_error
+                threshold_source = "--override-existing-l1-error fallback"
+
+            # No threshold resolved (no table row + no override): the keep
+            # filter has no basis, so we'd silently keep every combo.  Abort.
+            if existing_l1_error is None:
+                if rank == 0:
+                    var_units = da.attrs.get("units", "N/A")
+                    click.echo(
+                        f"[var] {var} | ERROR: cannot determine an L1 "
+                        f"error threshold for this variable (no row "
+                        f"matching Short Name='{var}' with Unit="
+                        f"'{var_units}' in the threshold table, or the "
+                        f"row's 'Existing L1 error' cell is empty/NaN).\n"
+                        f"  Re-run with --override-existing-l1-error "
+                        f"<value> to specify a threshold explicitly, or "
+                        f"add an entry for '{var}' to the lookup table "
+                        f"(remote Google Sheet, or the local fallback at "
+                        f"{_LOCAL_L1_THRESHOLDS_PATH})."
+                    )
+                comm.Abort(1)
+
+            if rank == 0:
+                click.echo(
+                    f"[var] {var} | units={da.attrs.get('units', 'N/A')} | "
+                    f"Existing L1 error={existing_l1_error} "
+                    f"(source: {threshold_source})"
+                )
 
-        compressors = utils.compressor_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, compressor_class)
-        filters = utils.filter_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, filter_class)
-        serializers = utils.serializer_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, serializer_class)
+            # -------------------------------------------------------------------------
+            # Build representative sample ONCE on rank 0, broadcast to others.
+            # -------------------------------------------------------------------------
+            # Memory guardrail before the sample broadcast.  No post-open
+            # refinement here (unlike the compress commands): evaluate_combos
+            # writes nothing, so there are no shard bytes to re-check against.
+            field_bytes = int(da.dtype.itemsize) * int(np.prod(da.shape))
+
+            # -------------------------------------------------------------------------
+            # Auto-shrink the sample budget so the per-rank steady estimate
+            # fits within the detected node memory budget.  This is what
+            # makes the sweep OOM-proof regardless of the user's
+            # --threads-per-rank choice: when threads are high the per-thread
+            # working set dominates, so the safe sample shrinks to keep
+            # (1 + threads * factor) * sample_bytes + chunk_overhead bounded
+            # by the cgroup/host budget.  The CLI flag --eval-data-size-limit
+            # acts as a ceiling, not a target.
+            # -------------------------------------------------------------------------
+            ebcc_in_search = bool(with_ebcc) and da.dtype != np.float32
+            node_budget_bytes, node_budget_source = _detect_node_memory_budget()
+            # Under SLURM cgroup-v2 the budget is per-task; without cgroup
+            # all ranks on the node share it.  Mirror the asymmetry from
+            # _check_node_memory_headroom so the auto-shrink uses the
+            # constraint that will actually bind.
+            is_cgroup_budget = node_budget_source.startswith("cgroup")
+            if is_cgroup_budget:
+                effective_budget = node_budget_bytes
+            else:
+                effective_budget = node_budget_bytes // max(1, ranks_on_node)
+            max_safe_sample = _max_sample_bytes_for_threads(
+                budget_bytes=int(effective_budget * memory_threshold),
+                threads_per_rank=threads_per_rank,
+                inner_chunk_mib=inner_chunk_mib,
+                include_ebcc_overhead=ebcc_in_search,
+            )
+            user_limit = int(eval_data_size_limit)
+            effective_sample_limit = min(user_limit, max_safe_sample)
+
+            if effective_sample_limit <= 0:
+                if rank == 0:
+                    click.echo(
+                        f"[memcheck] FATAL: cannot fit any sample. "
+                        f"threads_per_rank={threads_per_rank}, "
+                        f"inner_chunk_mib={inner_chunk_mib}, "
+                        f"node budget "
+                        f"{humanize.naturalsize(node_budget_bytes, binary=True)} "
+                        f"({node_budget_source}) at threshold "
+                        f"{memory_threshold:.2f}.  Reduce --threads-per-rank "
+                        f"or request more RAM (#SBATCH --mem=0)."
+                    )
+                _abort(1)
+
+            if rank == 0 and effective_sample_limit < user_limit:
+                click.echo(
+                    f"[memcheck] auto-shrunk sample budget from "
+                    f"{humanize.naturalsize(user_limit, binary=True)} "
+                    f"(--eval-data-size-limit) to "
+                    f"{humanize.naturalsize(effective_sample_limit, binary=True)} "
+                    f"to stay under {memory_threshold:.2f} x "
+                    f"{humanize.naturalsize(effective_budget, binary=True)} "
+                    f"({node_budget_source}) at "
+                    f"{threads_per_rank} threads.  To increase the safe "
+                    f"sample, drop --threads-per-rank or request more RAM."
+                )
 
-        num_compressors = len(compressors)
-        num_filters = len(filters)
-        num_serializers = len(serializers)
+            actual_sample_bytes = min(field_bytes, effective_sample_limit)
+            multiplier = 2 if (rank == 0 and size > 1) else 1
+            _check_memory_headroom(
+                multiplier * actual_sample_bytes,
+                label=f"sample for '{var}' on rank {rank} "
+                      f"({humanize.naturalsize(actual_sample_bytes, binary=True)})",
+                threshold=memory_threshold,
+            )
 
-        num_loops = num_compressors * num_filters * num_serializers
-        if rank == 0:
-            click.echo(f"Number of loops: {num_loops} ({num_compressors} compressors, {num_filters} filters, {num_serializers} serializers) -divided across {size} MPI task(s)-")
+            # Node-aggregate guardrail: the per-rank check above doesn't see
+            # other ranks on the same node or SLURM cgroup limits.  Fires only
+            # from rank 0.
+            per_rank_steady = _per_rank_steady_estimate_bytes(
+                sample_bytes=actual_sample_bytes,
+                threads_per_rank=threads_per_rank,
+                inner_chunk_mib=inner_chunk_mib,
+                # EBCC needs a float32 working copy on every rank that runs
+                # an EBCC combo (not just rank 0); upper bound, since we
+                # don't yet know whether EBCC will be selected.
+                include_ebcc_overhead=ebcc_in_search,
+            )
+            _check_node_memory_headroom(
+                per_rank_steady_bytes=per_rank_steady,
+                ranks_on_node=ranks_on_node,
+                rank=rank,
+                label=f"variable '{var}', sample "
+                      f"{humanize.naturalsize(actual_sample_bytes, binary=True)}",
+                threshold=memory_threshold,
+            )
 
-        config_space = list(itertools.product(compressors, filters, serializers))
-        configs_for_rank = config_space[rank::size]
+            if rank == 0:
+                sample_da_local = utils.build_representative_sample(
+                    da, effective_sample_limit, rank=rank,
+                )
+                # .compute() forces the dask read; we want the buffer, not a lazy handle.
+                sample_da_local = sample_da_local.compute()
+                sample_np_local = np.ascontiguousarray(sample_da_local.values)
+                sample_meta = {
+                    "dims": tuple(sample_da_local.dims),
+                    "attrs": dict(sample_da_local.attrs),
+                    "name":  sample_da_local.name,
+                }
+            else:
+                sample_np_local = None
+                sample_meta = None
+
+            # Bcast the numpy buffer via MPI's buffer protocol.  bcast() the small
+            # metadata dict via pickle (dims + attrs are tiny).
+            sample_np  = utils.broadcast_numpy(sample_np_local, comm=comm, root=0)
+            sample_meta = comm.bcast(sample_meta, root=0)
+
+            # Free the rank-0 duplicate ASAP so we fall from 2x transient to 1x
+            # steady state.  The broadcast has already committed the bytes to
+            # every rank's buffer; the local copy is no longer needed.
+            if rank == 0:
+                del sample_np_local
+
+            # Sample reproducibility hash: rank 0 writes
+            # sample_signature_{var}.json so compress_with_optimal can
+            # refuse mismatching reuse.
+            if rank == 0:
+                try:
+                    sig = _sample_signature(
+                        dataset_file=dataset_file,
+                        var=str(var),
+                        eval_data_size_limit=int(eval_data_size_limit),
+                        sample_np=sample_np,
+                    )
+                    _signature_path(where_to_write, str(var)).write_text(
+                        json.dumps(sig, indent=2)
+                    )
+                    click.echo(
+                        f"[sample-hash] {var}: sha256={sig['sha256'][:16]}… "
+                        f"shape={tuple(sig['shape'])} dtype={sig['dtype']} -> "
+                        f"{_signature_path(where_to_write, str(var)).name}"
+                    )
+                except Exception as sig_err:
+                    # Non-fatal: continue the sweep even if signature write
+                    # fails.  compress_with_optimal will log a softer warning
+                    # instead of blocking when the signature is absent.
+                    click.echo(
+                        f"[sample-hash] WARNING: could not write signature "
+                        f"for {var}: {sig_err}"
+                    )
+
+            # Reconstruct a DataArray view around the broadcast buffer.  The codec-
+            # space builders only use .dims / .shape / .values and basic arithmetic,
+            # so a thin wrapper without xarray coords is sufficient and avoids a
+            # second compute() on non-root ranks.
+            sample_da = xr.DataArray(
+                sample_np,
+                dims=sample_meta["dims"],
+                attrs=sample_meta["attrs"],
+                name=sample_meta["name"],
+            )
 
-        results = []
-        raw_values_explicit_with_names = []
-        total_configs = len(configs_for_rank)
-        if rank == 0:
-            pd.DataFrame(config_space).to_csv("config_space.csv", index=False)
-        for i, ((comp_idx, compressor), (filt_idx, filt), (ser_idx, serializer)) in enumerate(configs_for_rank):
-            data_to_compress = da
-            if isinstance(serializer, numcodecs.zarr3.ZFPY):
-                data_to_compress = da.stack(flat_dim=da.dims)
-            if isinstance(serializer, AnyNumcodecsArrayBytesCodec) and isinstance(serializer.codec, EBCCZarrFilter):
-                data_to_compress = da.squeeze().astype("float32")
-
-            filters_ = [filt,]
-            compressors_ = [compressor,]
-            serializer_ = serializer
-            
-            if isinstance(serializer_, AnyNumcodecsArrayBytesCodec) or filt is None:
-                filters_ = None  # TODO: fix (?) filter stacking with EBCC & numcodecs-wasm serializers
-                filt = None
-                filt_idx = -1
-            if compressor is None:
-                compressors_ = None
-                comp_idx = -1
-            if serializer is None:
-                serializer_ = "auto"
-                ser_idx = -1
+            # -------------------------------------------------------------------------
+            # Build codec spaces from the SAMPLE (deterministic; compress_with_optimal
+            # must use the same --eval-data-size-limit to reproduce these objects).
+            # -------------------------------------------------------------------------
+            compressors = utils.compressor_space(sample_da, with_lossy, with_numcodecs_wasm,
+                                                 with_ebcc, compressor_class)
+            filters     = utils.filter_space(sample_da, with_lossy, with_numcodecs_wasm,
+                                             with_ebcc, filter_class)
+            serializers = utils.serializer_space(sample_da, with_lossy, with_numcodecs_wasm,
+                                                 with_ebcc, serializer_class)
+
+            num_loops = len(compressors) * len(filters) * len(serializers)
+            config_space = list(itertools.product(compressors, filters, serializers))
+            # --max-evals: optional global cap for quick test runs.  Applied
+            # BEFORE the rank partition so all ranks see the same truncated
+            # space and the partition (configs[rank::size]) divides it evenly.
+            # The full config_space CSV written below is also truncated to
+            # match - that file is the audit trail of what was actually run.
+            if max_evals is not None and max_evals < num_loops:
+                if rank == 0:
+                    click.echo(
+                        f"[max-evals] capping config space at {max_evals} "
+                        f"(of {num_loops} possible) for a quick test run."
+                    )
+                config_space = config_space[:max_evals]
+                num_loops = len(config_space)
+            # Deterministic shuffle before stride partition: breaks up runs
+            # of similar-cost combos (e.g. consecutive EBCC entries) so each
+            # rank gets a representative mix.  Seed depends only on num_loops
+            # so --resume sees the same order across restarts.
+            _rng = np.random.default_rng(seed=int(num_loops) & 0xFFFFFFFF)
+            _perm = _rng.permutation(len(config_space)).tolist()
+            config_space = [config_space[i] for i in _perm]
+            configs_for_rank = config_space[rank::size]
+
+            if rank == 0:
+                # Topology banner: report nodes / ranks-per-node / threads-per-rank
+                # separately, and mark the case where the sweep is smaller than
+                # the theoretical peak parallelism.
+                n_nodes = size // ranks_on_node if ranks_on_node else 1
+                theoretical_peak = size * threads_per_rank
+                effective = min(theoretical_peak, num_loops)
+                trailer = ""
+                if effective < theoretical_peak:
+                    trailer = (
+                        f" (only {effective} will run concurrently; "
+                        f"{num_loops} combos total)"
+                    )
+                click.echo(
+                    f"[topology] {n_nodes} node(s) x {ranks_on_node} rank(s)/node x "
+                    f"{threads_per_rank} thread(s)/rank = {theoretical_peak} parallel "
+                    f"evaluations ({cores_avail} core(s)/rank){trailer}."
+                )
+                # Memory budget banner: numbers come from
+                # _per_rank_steady_estimate_bytes() so the abort check (fired
+                # pre-broadcast) and this user-facing estimate cannot drift.
+                steady_mib = int(sample_np.nbytes / 2**20)
+                ebcc_overhead_mib = int(sample_np.nbytes / 2**20) if (
+                    any(_is_ebcc_serializer(ser) for (_, ser) in serializers)
+                    and sample_np.dtype != np.float32
+                ) else 0
+                # Per-thread working set: each ThreadPoolExecutor worker
+                # runs evaluate_codec_pipeline concurrently with its own
+                # decoded buffer (~1x sample), encoded MemoryStore
+                # (~0.0-1x sample) and codec scratch.  Pre-patch the
+                # banner said "1x decompressed cache"; corrected to
+                # threads x PER_THREAD_WORKING_FACTOR.
+                decode_cache_mib = int(
+                    threads_per_rank
+                    * PER_THREAD_WORKING_FACTOR
+                    * sample_np.nbytes
+                    / 2**20
+                )
+                thread_pool_mib = threads_per_rank * max(1, inner_chunk_mib) * 2
+                _per_rank_total_bytes = _per_rank_steady_estimate_bytes(
+                    sample_bytes=int(sample_np.nbytes),
+                    threads_per_rank=threads_per_rank,
+                    inner_chunk_mib=inner_chunk_mib,
+                    include_ebcc_overhead=(ebcc_overhead_mib > 0),
+                )
+                click.echo(
+                    f"[memory] rank-0 transient peak ~= "
+                    f"{int(2 * sample_np.nbytes / 2**20)} MiB (during Bcast); "
+                    f"per-rank steady ~= {steady_mib + ebcc_overhead_mib} MiB "
+                    f"(sample + EBCC copy) + ~{decode_cache_mib} MiB "
+                    f"({threads_per_rank} threads x "
+                    f"{PER_THREAD_WORKING_FACTOR:.1f}x decode/encode cache) "
+                    f"+ ~{thread_pool_mib} MiB "
+                    f"(threads x 2 x inner_chunk_mib) = "
+                    f"{humanize.naturalsize(_per_rank_total_bytes, binary=True)} total."
+                )
+                # If the caller omitted --field-to-compress, flag the per-var
+                # Bcast cost so they're not surprised by 30 variables x 5 GB
+                # on a slow fabric.
+                n_vars_total = sum(
+                    1 for v in ds.data_vars
+                    if field_to_compress is None or v == field_to_compress
+                )
+                if n_vars_total > 1:
+                    click.echo(
+                        f"[topology] sweep will iterate {n_vars_total} variables; "
+                        f"one sample Bcast per variable (~"
+                        f"{humanize.naturalsize(sample_np.nbytes, binary=True)} each "
+                        f"over the interconnect)."
+                    )
+                click.echo(
+                    f"[sweep] {num_loops} combos "
+                    f"({len(compressors)} x {len(filters)} x {len(serializers)}) "
+                    f"split across {size} rank(s); ~{len(configs_for_rank)} per rank, "
+                    f"running {threads_per_rank}-wide."
+                )
+                pd.DataFrame(config_space).to_csv(
+                    os.path.join(where_to_write, f"config_space_{var}.csv"),
+                    index=False,
+                )
 
-            try:
-                compression_ratio, errors, euclidean_distance = utils.compress_with_zarr(
-                    data_to_compress,
-                    dataset_file,
-                    var,
-                    where_to_write,
-                    filters=filters_,
-                    compressors=compressors_,
-                    serializer=serializer_,
-                    verbose=False,
-                    rank=rank,
-                )
-
-                l1_error_rel = errors["Relative_Error_L1"]
-                l2_error_rel = errors["Relative_Error_L2"]
-                linf_error_rel = errors["Relative_Error_Linf"]
-
-                # TODO: refine criteria based on the thresholds table
-                if existing_l1_error:
-                    if l1_error_rel <= existing_l1_error:
-                        results.append(((str(compressor), str(filt), str(serializer), comp_idx, filt_idx, ser_idx), compression_ratio, l1_error_rel, euclidean_distance))
-                        raw_values_explicit_with_names.append((compression_ratio, l1_error_rel, l2_error_rel, linf_error_rel, euclidean_distance, str(compressor), str(filt), str(serializer)))
+            # -------------------------------------------------------------------------
+            # Pre-materialise a float32 view of the sample IF any EBCC combo is present
+            # and the dtype isn't already float32 (avoids per-thread float32 allocations).
+            # -------------------------------------------------------------------------
+            sample_np_ebcc = None
+            any_ebcc_local = any(
+                _is_ebcc_serializer(ser) for (_, ser) in serializers
+            )
+            if any_ebcc_local and sample_np.dtype != np.float32:
+                sample_np_ebcc = np.ascontiguousarray(
+                    np.squeeze(sample_np).astype(np.float32, copy=True)
+                )
+            elif any_ebcc_local:
+                sample_np_ebcc = np.ascontiguousarray(np.squeeze(sample_np))
+
+            # -------------------------------------------------------------------------
+            # Per-combo evaluator (runs inside a thread)
+            # -------------------------------------------------------------------------
+            def _evaluate_one(cfg):
+                (comp_idx, compressor), (filt_idx, filt), (ser_idx, serializer) = cfg
+
+                # Prep data + dims for this serializer's expectations
+                if _is_zfpy_serializer(serializer):
+                    data_np = sample_np.reshape(-1)  # flat view; no copy
+                    dims = ("flat_dim",)
+                elif _is_ebcc_serializer(serializer):
+                    data_np = sample_np_ebcc        # shared across threads
+                    dims = tuple(d for d, s in zip(sample_da.dims, sample_da.shape) if s > 1)
+                    if not dims:
+                        dims = sample_da.dims
                 else:
-                    results.append(((str(compressor), str(filt), str(serializer), comp_idx, filt_idx, ser_idx), compression_ratio, l1_error_rel, euclidean_distance))
-                    raw_values_explicit_with_names.append((compression_ratio, l1_error_rel, l2_error_rel, linf_error_rel, euclidean_distance, str(compressor), str(filt), str(serializer)))
-
-            except:
-                click.echo(f"Failed to compress with {compressor}, {filt}, {serializer} [Indices: {comp_idx}, {filt_idx}, {ser_idx}]")
-                traceback.print_exc(file=sys.stderr)
-                sys.exit(1)
+                    data_np = sample_np
+                    dims = sample_da.dims
+
+                # Pipeline assembly rules (match original semantics)
+                filters_ = [filt]
+                compressors_ = [compressor]
+                serializer_ = serializer
+                local_filt_idx = filt_idx
+                local_comp_idx = comp_idx
+                local_ser_idx = ser_idx
+
+                if isinstance(serializer_, AnyNumcodecsArrayBytesCodec) or filt is None:
+                    filters_ = None
+                    filt = None
+                    local_filt_idx = -1
+                if compressor is None:
+                    compressors_ = None
+                    local_comp_idx = -1
+                if serializer is None:
+                    serializer_ = "auto"
+                    local_ser_idx = -1
+
+                # Chunks for the eval memory store.  Same algorithm as the
+                # persist path (compute_chunk_and_shard_shape) so the measured
+                # compression ratio reflects production conditions.  When
+                # --no-spatial-split is set, chunks may exceed --inner-chunk-mib
+                # (one timestep, full spatial); a warning is emitted once if
+                # they also exceed --max-inner-chunk-mib.
+                eval_chunks = utils.compute_chunk_shape_for_eval(
+                    data_np.shape, data_np.dtype,
+                    target_mib=inner_chunk_mib,
+                    dims=dims,
+                    max_target_mib=max_inner_chunk_mib,
+                    allow_spatial_split=spatial_split,
+                )
+                _eval_chunk_bytes = int(np.dtype(data_np.dtype).itemsize) \
+                                    * int(np.prod(eval_chunks))
+                if (not spatial_split
+                        and _eval_chunk_bytes > max_inner_chunk_mib * 2**20
+                        and not getattr(_evaluate_one, "_warned_oversize", False)):
+                    click.echo(
+                        f"[chunks] WARNING: --no-spatial-split produced an eval "
+                        f"chunk of "
+                        f"{humanize.naturalsize(_eval_chunk_bytes, binary=True)} "
+                        f"(shape {eval_chunks}), which exceeds "
+                        f"--max-inner-chunk-mib ({max_inner_chunk_mib} MiB).  "
+                        f"Codec internals may misbehave at this size.  "
+                        f"Re-enable spatial splitting or lower the field size."
+                    )
+                    _evaluate_one._warned_oversize = True
+
+                ratio, errors, eucd = utils.evaluate_codec_pipeline(
+                    data_np, dims,
+                    filters=filters_, compressors=compressors_, serializer=serializer_,
+                    chunks=eval_chunks,
+                )
 
-            utils.progress_bar(i, total_configs, print_every=100)
+                return {
+                    "comp_idx":   local_comp_idx,
+                    "filt_idx":   local_filt_idx,
+                    "ser_idx":    local_ser_idx,
+                    "compressor": str(compressor),
+                    "filter":     str(filt),
+                    "serializer": str(serializer),
+                    "ratio":      float(ratio),
+                    "errors":     errors,
+                    "eucd":       float(eucd),
+                }
+
+            # -------------------------------------------------------------------------
+            # Thread pool: submit all combos for this rank, collect as they complete
+            # -------------------------------------------------------------------------
+            results = []
+            raw_values_explicit_with_names = []
+            failures = []
+
+            var_sweep_t0 = time.perf_counter()
+
+            # ----- Resume support --------------------------------------------
+            # Load already-completed (comp_idx, filt_idx, ser_idx) triples from
+            # the per-rank CSV if --resume is set and the CSV exists.  We skip
+            # these configs on submission and APPEND (not overwrite) to the CSV
+            # so the resumed run ends with one complete audit trail.
+            partial_csv_path = os.path.join(
+                where_to_write, f"config_space_{var}_rank{rank}.csv"
+            )
+            failures_csv_path = os.path.join(
+                where_to_write, f"failures_{var}_rank{rank}.csv"
+            )
+            already_done = set()
+            open_mode = "w"
+            if resume and Path(partial_csv_path).is_file():
+                try:
+                    prev = pd.read_csv(partial_csv_path)
+                    already_done = set(
+                        (int(a), int(b), int(c))
+                        for a, b, c in zip(
+                            prev["comp_idx"], prev["filt_idx"], prev["ser_idx"]
+                        )
+                    )
+                    open_mode = "a"
+                    if rank == 0:
+                        click.echo(
+                            f"[resume] rank 0 found {len(already_done)} previously-"
+                            f"recorded combo(s) for '{var}'; skipping."
+                        )
+                except Exception as resume_err:
+                    if rank == 0:
+                        click.echo(
+                            f"[resume] WARNING: failed to parse previous "
+                            f"partial CSV {partial_csv_path}: {resume_err}. "
+                            f"Starting from scratch."
+                        )
+                    already_done = set()
+                    open_mode = "w"
+
+            def _cfg_key(cfg):
+                (comp_idx, _), (filt_idx, _), (ser_idx, _) = cfg
+                return (int(comp_idx), int(filt_idx), int(ser_idx))
+
+            configs_pending = [
+                cfg for cfg in configs_for_rank
+                if _cfg_key(cfg) not in already_done
+            ]
+            # Post-resume total: progress bar's 100% line can only fire when
+            # `done == total_local`, so this must reflect what will actually
+            # be submitted, not the pre-resume count.
+            total_local = max(1, len(configs_pending))
+
+            # Streaming CSVs: partial = one row per success, failures = one
+            # row per exception.  Header written iff file is empty (resume-safe).
+            # Batched flush every FLUSH_EVERY rows to keep MDS pressure low.
+            FLUSH_EVERY = 100
+            rows_since_flush = 0
+            failed_rows_since_flush = 0
+
+            partial_exists = (
+                open_mode == "a"
+                and Path(partial_csv_path).is_file()
+                and Path(partial_csv_path).stat().st_size > 0
+            )
+            failures_exists = (
+                open_mode == "a"
+                and Path(failures_csv_path).is_file()
+                and Path(failures_csv_path).stat().st_size > 0
+            )
 
-        results_gather = comm.gather(results, root=0)
-        raw_values_explicit_with_names_gather = comm.gather(raw_values_explicit_with_names, root=0)
+            with open(partial_csv_path, open_mode, newline="") as partial_csv_file, \
+                 open(failures_csv_path, open_mode, newline="") as failures_csv_file:
+                partial_csv_writer = csv.writer(partial_csv_file)
+                failures_csv_writer = csv.writer(failures_csv_file)
+                if not partial_exists:
+                    partial_csv_writer.writerow([
+                        "compressor", "filter", "serializer",
+                        "comp_idx", "filt_idx", "ser_idx",
+                        "ratio", "l1_rel", "l2_rel", "linf_rel", "eucd",
+                        "keep",
+                    ])
+                if not failures_exists:
+                    failures_csv_writer.writerow([
+                        "compressor", "filter", "serializer", "error",
+                    ])
+                # From here on, per-combo threads provide parallelism.  Dask runs
+                # serially inside each thread to avoid nested thread pools.  The
+                # synchronous-scheduler setting is scoped with `with dask.config.set`
+                # so it reverts automatically when we leave the sweep block - it
+                # wouldn't leak in the CLI flow (one process per command), but this
+                # keeps evaluate_combos safe to import into notebooks or compose in
+                # longer-lived processes.
+                with dask.config.set(scheduler="synchronous"):
+                    with ThreadPoolExecutor(max_workers=threads_per_rank) as pool:
+                        future_to_cfg = {
+                            pool.submit(_evaluate_one, cfg): cfg for cfg in configs_pending
+                        }
+
+                        for fut in as_completed(future_to_cfg):
+                            cfg = future_to_cfg[fut]
+
+                            try:
+                                r = fut.result()
+                            except Exception as e:
+                                # Never crash the sweep on a single combo failure.
+                                # Failures are logged to failures_{var}_rank{rank}.csv
+                                # and aggregated across ranks at the end of the sweep.
+                                (_, compressor), (_, filt), (_, serializer) = cfg
+                                failures.append((str(compressor), str(filt), str(serializer), repr(e)))
+                                failures_csv_writer.writerow([
+                                    str(compressor), str(filt), str(serializer), repr(e),
+                                ])
+                                failed_rows_since_flush += 1
+                                if failed_rows_since_flush >= FLUSH_EVERY:
+                                    failures_csv_file.flush()
+                                    failed_rows_since_flush = 0
+                                utils.progress_bar(total_local, print_every=100, key=str(var))
+                                continue
+
+                            l1_rel   = r["errors"]["Relative_Error_L1"]
+                            l2_rel   = r["errors"]["Relative_Error_L2"]
+                            linf_rel = r["errors"]["Relative_Error_Linf"]
+
+                            keep = True
+                            if existing_l1_error is not None:
+                                keep = (l1_rel <= existing_l1_error)
+
+                            # Per-rank streaming audit row.  Written for every
+                            # successful evaluation, including filtered-out ones (the
+                            # `keep` column distinguishes).  Batched flushes keep MDS
+                            # pressure down while still surviving a mid-sweep crash
+                            # up to FLUSH_EVERY rows.
+                            partial_csv_writer.writerow([
+                                r["compressor"], r["filter"], r["serializer"],
+                                r["comp_idx"], r["filt_idx"], r["ser_idx"],
+                                r["ratio"], l1_rel, l2_rel, linf_rel, r["eucd"],
+                                keep,
+                            ])
+                            rows_since_flush += 1
+                            if rows_since_flush >= FLUSH_EVERY:
+                                partial_csv_file.flush()
+                                rows_since_flush = 0
+
+                            if keep:
+                                results.append((
+                                    (r["compressor"], r["filter"], r["serializer"],
+                                     r["comp_idx"], r["filt_idx"], r["ser_idx"]),
+                                    r["ratio"], l1_rel, r["eucd"],
+                                ))
+                                raw_values_explicit_with_names.append((
+                                    r["ratio"], l1_rel, l2_rel, linf_rel, r["eucd"],
+                                    r["compressor"], r["filter"], r["serializer"],
+                                ))
+
+                            utils.progress_bar(total_local, print_every=100, key=str(var))
+
+                # Flush any remaining buffered rows before closing the files.
+                partial_csv_file.flush()
+                failures_csv_file.flush()
+
+            # ----- Aggregate failure details across ranks (M2) ---------------
+            # Gather the first few failures from every rank so the user can see
+            # node-local issues (bad EBCC geometry on one host, codec-library
+            # mismatch on another) even when rank 0 is clean.  Limit to 5 per
+            # rank to keep the pickle small.
+            sample_failures = failures[:5]
+            all_failures = comm.gather(sample_failures, root=0)
+            total_failures = comm.reduce(len(failures), op=MPI.SUM, root=0)
+
+            if rank == 0 and total_failures and total_failures > 0:
+                click.echo(
+                    f"[warning] {total_failures} combo(s) failed total across "
+                    f"{size} rank(s)."
+                )
+                shown = 0
+                for r_idx, batch in enumerate(all_failures):
+                    for compressor, filt, serializer, err in batch:
+                        click.echo(
+                            f"  [rank {r_idx}] {compressor} | {filt} | {serializer}: {err}"
+                        )
+                        shown += 1
+                        if shown >= 30:
+                            break
+                    if shown >= 30:
+                        break
+                if total_failures > shown:
+                    click.echo(
+                        f"  ... and {total_failures - shown} more "
+                        f"(full details in failures_{var}_rank*.csv)."
+                    )
+
+            # Per-variable sweep timing for the run manifest.
+            var_sweep_seconds = time.perf_counter() - var_sweep_t0
+
+            # -------------------------------------------------------------------------
+            # Gather + best-combo selection (rank 0)
+            # -------------------------------------------------------------------------
+            results_gather = comm.gather(results, root=0)
+            raw_gather     = comm.gather(raw_values_explicit_with_names, root=0)
+
+            if rank == 0:
+                click.echo("[sweep] complete. Writing results...")
+                results_gather = list(itertools.chain.from_iterable(results_gather))
+                raw_gather     = list(itertools.chain.from_iterable(raw_gather))
+
+                lossy_option          = "with-lossy" if with_lossy else "without-lossy"
+                numcodecs_wasm_option = "with-numcodecs-wasm" if with_numcodecs_wasm else "without-numcodecs-wasm"
+                ebcc_option           = "with-ebcc" if with_ebcc else "without-ebcc"
+                # `var` is used unconditionally here (was `field_to_compress or "all"`)
+                # so that when the caller omits --field-to-compress and we iterate
+                # over every data_var, each iteration produces a distinct filename.
+                score_tag = [
+                    var,
+                    compressor_class, filter_class, serializer_class,
+                    lossy_option, numcodecs_wasm_option, ebcc_option,
+                ]
+                npy_path = os.path.join(
+                    where_to_write,
+                    os.path.basename(dataset_file) + "_" + "_".join(score_tag)
+                    + "_scored_results_with_names.npy",
+                )
+                np.save(npy_path, np.asarray(pd.DataFrame(raw_gather)))
 
-        if rank == 0:
-            click.echo("Compressors analysis completed. Writing files...")
-            # Flatten list of lists
-            results_gather = list(itertools.chain.from_iterable(results_gather))
-            raw_values_explicit_with_names_gather = list(itertools.chain.from_iterable(raw_values_explicit_with_names_gather))
-
-            # Needed for clustering
-            lossy_option = "with-lossy" if with_lossy else "without-lossy"
-            numcodecs_wasm_option = "with-numcodecs-wasm" if with_numcodecs_wasm else "without-numcodecs-wasm"
-            ebcc_option = "with-ebcc" if with_ebcc else "without-ebcc"
-            score_results_file_name = [field_to_compress, compressor_class, filter_class, serializer_class, lossy_option, numcodecs_wasm_option, ebcc_option]
-            np.save(os.path.basename(dataset_file) + '_' + '_'.join(score_results_file_name) + '_scored_results_with_names.npy', np.asarray(pd.DataFrame(raw_values_explicit_with_names_gather)))
-            best_combo = max(results_gather, key=lambda x: x[1])
-            msg = (
-                "optimal combo: \n"
-                f"compressor : {best_combo[0][0]}\nfilter     : {best_combo[0][1]}\nserializer : {best_combo[0][2]}\n"
-                "corresponding indices in lists of instantiated objects:\n"
-                f"compressor : {best_combo[0][3]}\nfilter     : {best_combo[0][4]}\nserializer : {best_combo[0][5]}\n"
-                f"Compression Ratio: {best_combo[1]:.3f} | Relative L1 Error: {best_combo[2]:.3e} | Euclidean Distance: {best_combo[3]:.3e}"
-            )
-            click.echo(msg)
+                # Consolidate per-rank streaming CSVs into one parquet
+                partial_paths = sorted(
+                    Path(where_to_write).glob(f"config_space_{var}_rank*.csv")
+                )
+                if partial_paths:
+                    consolidated = pd.concat(
+                        [pd.read_csv(p) for p in partial_paths],
+                        ignore_index=True,
+                    )
+                    parquet_path = os.path.join(
+                        where_to_write, f"results_{var}.parquet"
+                    )
+                    consolidated.to_parquet(parquet_path, index=False)
+                    click.echo(
+                        f"[sweep] consolidated {len(partial_paths)} per-rank "
+                        f"CSV(s) -> {parquet_path} "
+                        f"({len(consolidated)} row(s))."
+                    )
+
+                if results_gather:
+                    best = max(results_gather, key=lambda x: x[1])
+                    click.echo(
+                        "optimal combo:\n"
+                        f"compressor : {best[0][0]}\n"
+                        f"filter     : {best[0][1]}\n"
+                        f"serializer : {best[0][2]}\n"
+                        "corresponding indices in lists of instantiated objects:\n"
+                        f"compressor : {best[0][3]}\n"
+                        f"filter     : {best[0][4]}\n"
+                        f"serializer : {best[0][5]}\n"
+                        f"Compression Ratio: {best[1]:.3f} | "
+                        f"Relative L1 Error: {best[2]:.3e} | "
+                        f"Euclidean Distance: {best[3]:.3e}"
+                    )
+                else:
+                    click.echo("[sweep] no combos passed the threshold filter.")
+
+                # -------------------------------------------------------------
+                # Run manifest (machine-readable summary per variable).
+                # Useful for CI / downstream tooling that wants the best combo
+                # without parsing stdout.  Everything is primitive/JSON-safe.
+                # -------------------------------------------------------------
+                manifest = {
+                    "command": "evaluate_combos",
+                    "dataset_file": os.fspath(dataset_file),
+                    "var": str(var),
+                    "where_to_write": os.fspath(where_to_write),
+                    "args": {
+                        "eval_data_size_limit": int(eval_data_size_limit),
+                        "threads_per_rank": int(threads_per_rank),
+                        "codec_threads": int(codec_threads or 1),
+                        "inner_chunk_mib": int(inner_chunk_mib),
+                        "max_inner_chunk_mib": int(max_inner_chunk_mib),
+                        "spatial_split": bool(spatial_split),
+                        "compressor_class": compressor_class,
+                        "filter_class": filter_class,
+                        "serializer_class": serializer_class,
+                        "with_lossy": bool(with_lossy),
+                        "with_numcodecs_wasm": bool(with_numcodecs_wasm),
+                        "with_ebcc": bool(with_ebcc),
+                        "override_existing_l1_error": override_existing_l1_error,
+                        "resume": bool(resume),
+                    },
+                    "topology": {
+                        "size": int(size),
+                        "cores_avail": int(cores_avail),
+                    },
+                    "existing_l1_error": existing_l1_error,
+                    "num_combos": int(num_loops),
+                    "num_passed": int(len(results_gather)),
+                    "num_failed_total": int(total_failures or 0),
+                    "num_filtered": int(
+                        num_loops - len(results_gather) - (total_failures or 0)
+                    ),
+                    "var_sweep_seconds": float(var_sweep_seconds),
+                    "env": {
+                        "zarr": getattr(zarr, "__version__", None),
+                        "numpy": getattr(np, "__version__", None),
+                        "dask": getattr(dask, "__version__", None),
+                    },
+                    "sample_signature_path": os.fspath(
+                        _signature_path(where_to_write, str(var))
+                    ),
+                    "outputs": {
+                        "npy": os.fspath(npy_path),
+                        "parquet": (
+                            os.fspath(parquet_path) if partial_paths else None
+                        ),
+                        "config_space_csv": os.fspath(
+                            Path(where_to_write) / f"config_space_{var}.csv"
+                        ),
+                    },
+                    "best": None,
+                }
+                if results_gather:
+                    manifest["best"] = {
+                        "compressor": best[0][0],
+                        "filter":     best[0][1],
+                        "serializer": best[0][2],
+                        "comp_idx":   int(best[0][3]),
+                        "filt_idx":   int(best[0][4]),
+                        "ser_idx":    int(best[0][5]),
+                        "ratio":      float(best[1]),
+                        "l1_rel":     float(best[2]),
+                        "eucd":       float(best[3]),
+                    }
+
+                manifest_path = os.path.join(
+                    where_to_write, f"manifest_{var}.json"
+                )
+                try:
+                    with open(manifest_path, "w") as mf:
+                        json.dump(manifest, mf, indent=2, default=str)
+                    click.echo(f"[sweep] wrote manifest -> {manifest_path}")
+                except Exception as manifest_err:
+                    click.echo(
+                        f"[sweep] WARNING: could not write manifest "
+                        f"{manifest_path}: {manifest_err}"
+                    )
 
 
 @cli.command("compress_with_optimal")
@@ -259,230 +1625,1467 @@ def evaluate_combos(dataset_file: str, where_to_write: str,
 @click.argument("comp_idx", type=int)
 @click.argument("filt_idx", type=int)
 @click.argument("ser_idx", type=int)
-@click.option("--compressor-class", default="all", help="Same as in evaluate_combos.")
-@click.option("--filter-class", default="all", help="Same as in evaluate_combos.")
-@click.option("--serializer-class", default="all", help="Same as in evaluate_combos.")
-@click.option("--with-lossy/--without-lossy", default=True, show_default=True, help="Same as in evaluate_combos.")
-@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=True, show_default=True, help="Same as in evaluate_combos.")
-@click.option("--with-ebcc/--without-ebcc", default=True, show_default=True, help="Same as in evaluate_combos.")
-def compress_with_optimal(dataset_file, where_to_write, field_to_compress, 
-                          comp_idx, filt_idx, ser_idx, 
-                          compressor_class: str = "all", filter_class: str = "all", serializer_class: str = "all",
-                          with_lossy: bool = True, with_numcodecs_wasm: bool = True, with_ebcc: bool = True):
+@click.option("--eval-data-size-limit", default="5GB", callback=_size_option_callback,
+              show_default=True,
+              help="Size budget for the sample used to build the codec space "
+                   "(i.e. to compute data-derived codec parameters such as "
+                   "Asinh.linear_width, FixedOffsetScale.offset/scale, and EBCC "
+                   "chunk geometry).  The FULL FIELD is always compressed - this "
+                   "flag does NOT control what gets written.  "
+                   "Must match the value used in evaluate_combos so the codec-space "
+                   "indices (comp_idx, filt_idx, ser_idx) resolve to identical codec "
+                   "objects.")
+@click.option("--inner-chunk-mib", type=int, default=16, show_default=True,
+              help="Target size of a zarr inner chunk, in MiB. "
+                   "For the measured compression ratio in evaluate_combos to reflect "
+                   "production conditions, pass the same value here.")
+@click.option("--max-inner-chunk-mib", type=int, default=256, show_default=True,
+              help="Hard ceiling on inner chunk size (in MiB).  Only enforced "
+                   "with --no-spatial-split: when one timestep already exceeds "
+                   "--inner-chunk-mib and spatial splitting is disabled, the "
+                   "resulting chunk may be very large; if it also exceeds this "
+                   "ceiling we emit a warning.")
+@click.option("--spatial-split/--no-spatial-split", default=True, show_default=True,
+              help="When one timestep already exceeds --inner-chunk-mib, split "
+                   "spatial dims (horizontal/cell first, vertical last -- "
+                   "hiopy approach) until the chunk fits the target.  Disable "
+                   "with --no-spatial-split to keep one timestep per chunk "
+                   "with full spatial extent (matches the hiopy on-disk "
+                   "layout, but produces oversized chunks AND skips sharding "
+                   "since one shard would only bundle one chunk).")
+@click.option("--shard-mib", type=int, default=512, show_default=True,
+              help="Target shard size in MiB. Each shard contains an integer "
+                   "number of inner chunks.  When one inner chunk already "
+                   "meets or exceeds this target, sharding is skipped "
+                   "automatically (a shard bundling <= 1 chunk would add "
+                   "only index overhead).")
+@click.option("--threads", type=int, default=None,
+              help="Number of dask workers used for the parallel write. "
+                   "Default: auto-detected from visible cores. "
+                   "Peak memory use during the write is roughly threads * shard_mib.")
+@click.option("--codec-threads", type=int, default=1, show_default=True,
+              help="Internal threads per codec call (Blosc set live; for "
+                   "OpenMP/MKL/OpenBLAS export the matching env vars in the "
+                   "shell BEFORE running). --threads * --codec-threads must "
+                   "be <= physical cores; oversubscription-check is skipped "
+                   "when this is > 1.")
+@click.option("--oversubscription-check/--no-oversubscription-check", default=True,
+              show_default=True,
+              help="At startup, warn/abort if OMP/BLOSC/MKL thread vars aren't pinned to 1.")
+@click.option("--memory-threshold", type=click.FloatRange(0.05, 0.95), default=0.80,
+              show_default=True,
+              help="Fraction of currently-available RAM that any single tracked "
+                   "allocation is allowed to occupy before the run is aborted.  "
+                   "Defaults to 0.80; values above 0.80 emit a one-time warning "
+                   "because the documented 1.5-2x rechunk transient can exceed "
+                   "the remaining buffer.  Hard upper bound 0.95.")
+@click.option("--verify/--no-verify", default=True, show_default=True,
+              help="After the write finishes, re-read the persisted store and "
+                   "recompute the relative L1/L2/Linf error norms and Euclidean "
+                   "distance against the in-memory original.  On by default as a "
+                   "safety net (catches silent codec bugs and I/O corruption).  "
+                   "Cost: a full second pass of the dataset through the reader, "
+                   "which roughly doubles the wall time of compress_with_optimal. "
+                   "Pass --no-verify for routine production runs where the "
+                   "(compressor, filter, serializer) combo is already trusted - "
+                   "e.g. re-compressing sibling fields with a combo vetted on a "
+                   "prior run - and re-reading the shared Zarr store is the "
+                   "bottleneck.")
+@click.option("--compressor-class", default="all")
+@click.option("--filter-class", default="all")
+@click.option("--serializer-class", default="all")
+@click.option("--with-lossy/--without-lossy", default=True, show_default=True)
+@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=False, show_default=True)
+@click.option("--with-ebcc/--without-ebcc", default=False, show_default=True)
+@click.option("--force/--no-force", default=False, show_default=True,
+              help="Suppress the warning emitted when the (comp_idx, filt_idx, "
+                   "ser_idx) you pass does not match the best combo recorded in "
+                   "manifest_{field}.json by evaluate_combos.  Default is to "
+                   "warn (non-fatal) so typos and stale indices get flagged.  "
+                   "Pass --force when you deliberately want to write a non-best "
+                   "combo (e.g. exploring the Pareto front, testing a fallback).")
+def compress_with_optimal(dataset_file, where_to_write, field_to_compress,
+                          comp_idx, filt_idx, ser_idx,
+                          eval_data_size_limit,
+                          inner_chunk_mib, max_inner_chunk_mib,
+                          spatial_split, shard_mib,
+                          threads, codec_threads,
+                          oversubscription_check, memory_threshold,
+                          verify,
+                          compressor_class, filter_class, serializer_class,
+                          with_lossy, with_numcodecs_wasm, with_ebcc,
+                          force):
     """
-    Compress a field with the optimal combination of 
-    compressor, filter, and serializer as generated by the evaluate_combos command.
+    Compress a single field with the combo chosen by evaluate_combos, streaming
+    directly into the shared {where_to_write}/{dataset}.zarr store under
+    component=field_to_compress.
+
+    Run this command once per field; all invocations write into the same store.
+    After all fields are compressed, call `merge_compressed_fields` to
+    consolidate the metadata.
+
+    What gets compressed
+    --------------------
+    The FULL FIELD.  Sampling is not used to decide what to write - it is used
+    only to parameterize the codec space (see below).
+
+    Codec-space reproducibility
+    ---------------------------
+    Several codecs have parameters derived from data statistics: Asinh's
+    linear_width comes from a quantile of |da|; FixedOffsetScale's offset and
+    scale come from da.mean/std/min/max; EBCC's chunk geometry comes from the
+    field shape.  These are computed inside compressor_space / filter_space /
+    serializer_space to produce a list of pre-instantiated codec objects, and
+    comp_idx / filt_idx / ser_idx index into those lists.
+
+    For an index produced by evaluate_combos to resolve to the SAME codec
+    object here, both commands must build the codec space the same way - which
+    means computing those statistics on the same data.  We do this by sampling
+    the field with build_representative_sample and feeding the sample into the
+    space builders.  Because the sampling is deterministic (np.linspace), the
+    two commands produce identical samples as long as they see the same
+    --eval-data-size-limit value.
+
+    Mismatch failure mode: no crash, no warning - just a codec object with
+    slightly different parameters than the one that won the sweep.  The
+    symptom is a worse compression ratio than evaluate_combos reported.
+
+    Parallelism
+    -----------
+    This command runs as a single MPI process, but the write itself is
+    parallelised by dask's threaded scheduler.  Use `--threads N` to cap the
+    number of dask workers; the default auto-detects from visible cores.
+    Peak in-flight memory during the write is roughly `threads * shard_mib`
+    bytes - reduce `--threads` if memory-constrained.
+    """
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    size = comm.Get_size()
+    if size > 1:
+        if rank == 0:
+            click.echo("compress_with_optimal is not meant to run in parallel. "
+                       "Launch it with a single process.")
+        # Collective abort: if we got here, the user launched this with
+        # `mpirun -n >1`; sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
 
-    Make sure to provide the same --[compressor/filter/serializer]-class and the same --with/without-[lossy/numcodecs-wasm/ebcc] flags as in evaluate_combos,
-    such that the same lists of instantiated objects are generated.
-    
-    Note on passing -1 as index:
-    dc_toolkit compress_with_optimal ... --compressor-class X ... --- -1 -1 -1
+    os.makedirs(where_to_write, exist_ok=True)
 
-    \b
-    Args:
-        dataset_file (str): Path to the input dataset file.
-        where_to_write (str): Directory where the compressed output will be written.
-        field_to_compress (str): Name of the field to compress.
-        comp_idx (int): Index of the compressor to use.
-        filt_idx (int): Index of the filter to use.
-        ser_idx (int): Index of the serializer to use.
-        compressor_class: --compressor-class
-        filter_class: --filter-class
-        serializer_class: --serializer-class
-        with_lossy: --with-lossy/--without-lossy
-        with_numcodecs_wasm: --with-numcodecs-wasm/--without-numcodecs-wasm
-        with_ebcc: --with-ebcc/--without-ebcc
+    click.echo(_version_banner("compress_with_optimal"))
+
+    # Manifest cross-checks: warn if the user's (comp_idx,filt_idx,ser_idx)
+    # differs from the recorded best, or if zarr/numpy/dask versions changed
+    # since the sweep.  --force suppresses both warnings.
+    manifest_path = Path(where_to_write) / f"manifest_{field_to_compress}.json"
+    if manifest_path.is_file() and not force:
+        try:
+            manifest = json.loads(manifest_path.read_text())
+            # ---- best-combo check ----
+            best = manifest.get("best")
+            if best is not None:
+                best_triple = (int(best["comp_idx"]),
+                               int(best["filt_idx"]),
+                               int(best["ser_idx"]))
+                user_triple = (int(comp_idx), int(filt_idx), int(ser_idx))
+                if user_triple != best_triple:
+                    click.echo(
+                        f"[manifest] WARNING: {manifest_path.name} says best "
+                        f"is {best_triple}; you passed {user_triple}."
+                    )
+                    click.echo(
+                        f"  Best combo per manifest:  "
+                        f"compressor={best['compressor']}  "
+                        f"filter={best['filter']}  "
+                        f"serializer={best['serializer']}  "
+                        f"ratio={best['ratio']:.3f}"
+                    )
+                    click.echo(
+                        "  Proceeding anyway.  Pass --force to suppress this "
+                        "warning, or re-run with the manifest triple to use "
+                        "the sweep's best combo."
+                    )
+            # ---- library-version check ----
+            # Minor version differences (e.g. dask 2026.3.0 -> 2026.3.1) are
+            # usually harmless but decode paths in xarray / netCDF4 can shift
+            # bytes across version upgrades, which would trip the sample-
+            # signature hash check below.  We report differences here so the
+            # user can connect a hash mismatch to a library upgrade rather
+            # than hunting for a flag they didn't change.
+            sweep_env = manifest.get("env", {}) or {}
+            current_env = {
+                "zarr":  getattr(zarr, "__version__", None),
+                "numpy": getattr(np,   "__version__", None),
+                "dask":  getattr(dask, "__version__", None),
+            }
+            env_deltas = [
+                (pkg, sweep_env.get(pkg), current_env.get(pkg))
+                for pkg in ("zarr", "numpy", "dask")
+                if sweep_env.get(pkg) is not None
+                and sweep_env.get(pkg) != current_env.get(pkg)
+            ]
+            if env_deltas:
+                click.echo(
+                    f"[manifest] WARNING: library versions differ from the "
+                    f"sweep that wrote {manifest_path.name}:"
+                )
+                for pkg, sweep_ver, now_ver in env_deltas:
+                    click.echo(f"  {pkg}: sweep={sweep_ver}  now={now_ver}")
+                click.echo(
+                    "  If the sample-signature check below reports a hash "
+                    "mismatch, a decode-path change across these versions is "
+                    "a likely cause.  Either rerun evaluate_combos in the "
+                    "current environment, or switch back to the sweep's "
+                    "environment.  Pass --force to suppress this warning."
+                )
+        except Exception as manifest_err:
+            # Don't block the run on an unparseable manifest - users may have
+            # hand-edited it, or it may be from an older toolkit version.
+            click.echo(
+                f"[manifest] WARNING: could not parse {manifest_path.name}: "
+                f"{manifest_err}.  Skipping best-combo check."
+            )
+
+    # -------------------------------------------------------------------------
+    # Thread & dask configuration
+    #
+    # The write goes through dask.array.to_zarr, which will parallelise the
+    # codec pipeline across shards.  We make the worker count explicit so the
+    # user can control peak memory (roughly: threads * shard_mib).
+    # -------------------------------------------------------------------------
+    _reset_memcheck_state()
+    cores_avail = utils.detect_cores_available()
+    if threads is None:
+        threads = cores_avail
+    _apply_codec_threads(codec_threads)
+    _check_thread_product(threads, codec_threads)
+    if int(codec_threads or 1) <= 1:
+        utils.check_thread_oversubscription(
+            abort_if_unsafe=oversubscription_check, rank=rank,
+        )
+
+    # Both memory guardrails (write peak + codec-space sample) are deferred
+    # until after the dataset is opened, so we can check against the ACTUAL
+    # data size rather than a configuration upper bound.  Checking the
+    # raw `threads * shard_mib` here would spuriously abort tiny fields
+    # whose total bytes are smaller than a single shard.
+
+    # Scope the scheduler + worker-count settings to this function so they
+    # don't leak if compress_with_optimal is imported and called from a
+    # notebook or longer-lived process.  No-op difference for the single-
+    # command CLI flow (process exits immediately after), but matches the
+    # pattern used in evaluate_combos.
+    with dask.config.set(scheduler="threads", num_workers=int(threads)):
+        click.echo(
+            f"[topology] {cores_avail} core(s) visible; "
+            f"dask will use {threads} worker(s) for the write. "
+            f"Peak working set ~= {threads} x shard_mib ({shard_mib} MiB) = "
+            f"{threads * shard_mib} MiB (documented; rechunk transients may "
+            f"push 1.5-2x this on fields with adverse source chunking)."
+        )
+
+        # Open dataset + field (lazy).  The FULL FIELD `da` is what gets
+        # compressed; the sample below is used only to build the codec space.
+        ds = utils.open_dataset(dataset_file, field_to_compress)
+        da = ds[field_to_compress]
+        field_bytes = int(da.dtype.itemsize) * int(np.prod(da.shape))
+
+        # Memory guardrail for the write: documented peak is threads * shard_mib,
+        # but capped by field_bytes - you cannot have more transient working
+        # memory than there is data to process.  For a field smaller than one
+        # shard, the real peak is ~field_bytes; for a multi-GB field, it
+        # saturates at threads * shard_mib.  Rechunk transients can push
+        # 1.5-2x above that; we check against the documented peak as a floor.
+        write_peak_bytes = min(int(threads) * int(shard_mib) * 2**20, field_bytes)
+        _check_memory_headroom(
+            write_peak_bytes,
+            label=f"compress_with_optimal write peak for '{field_to_compress}' "
+                  f"(min(threads x shard_mib, field bytes) = "
+                  f"{humanize.naturalsize(write_peak_bytes, binary=True)})",
+            threshold=memory_threshold,
+        )
+
+        # Now that `da` is known, check the ACTUAL sample allocation size
+        # against available RAM (not the budget-as-upper-bound).
+        actual_sample_bytes = min(field_bytes, int(eval_data_size_limit))
+        _check_memory_headroom(
+            actual_sample_bytes,
+            label=f"codec-space sample for '{field_to_compress}' "
+                  f"({humanize.naturalsize(actual_sample_bytes, binary=True)})",
+            threshold=memory_threshold,
+        )
+
+        # Sample for codec-space construction.  Same --eval-data-size-limit
+        # as the sweep -> identical pre-instantiated codec objects, so
+        # comp_idx/filt_idx/ser_idx resolve consistently.  This sample is
+        # NOT what gets compressed (we compress `da` below).
+        sample_for_codec_space = utils.build_representative_sample(
+            da, eval_data_size_limit,
+        ).compute()
+
+        # Sample reproducibility hash: if evaluate_combos wrote a signature,
+        # recompute and compare.  Mismatch means codec-space indices resolve
+        # to different objects than the sweep measured — refuse to continue.
+        # Absent signature: warn once and proceed (older sweep, manual indices).
+        sig_path = _signature_path(where_to_write, str(field_to_compress))
+        if sig_path.is_file():
+            try:
+                expected = json.loads(sig_path.read_text())
+                sample_np_view = np.ascontiguousarray(
+                    sample_for_codec_space.values
+                )
+                observed = _sample_signature(
+                    dataset_file=dataset_file,
+                    var=str(field_to_compress),
+                    eval_data_size_limit=int(eval_data_size_limit),
+                    sample_np=sample_np_view,
+                )
+                fields_to_check = (
+                    "dataset_stem", "var", "eval_data_size_limit",
+                    "shape", "dtype", "nbytes", "sha256",
+                )
+                mismatches = [
+                    f for f in fields_to_check
+                    if expected.get(f) != observed.get(f)
+                ]
+                if mismatches:
+                    click.echo(
+                        f"[sample-hash] MISMATCH vs {sig_path.name}: "
+                        f"differing fields = {mismatches}"
+                    )
+                    for f in mismatches:
+                        click.echo(
+                            f"  {f}: expected={expected.get(f)} "
+                            f"observed={observed.get(f)}"
+                        )
+                    click.echo(
+                        "  Most common cause: different --eval-data-size-limit "
+                        "between sweep and reuse.  Re-run with matching flag."
+                    )
+                    sys.exit(1)
+                else:
+                    click.echo(
+                        f"[sample-hash] OK, matches {sig_path.name} "
+                        f"(sha256={observed['sha256'][:16]}…)."
+                    )
+                # Free the transient copy; sample_for_codec_space (the xarray
+                # wrapper) still holds the underlying buffer via its .values.
+                del sample_np_view
+            except Exception as sig_err:
+                click.echo(
+                    f"[sample-hash] WARNING: could not verify signature "
+                    f"{sig_path.name}: {sig_err}.  Proceeding without check."
+                )
+        else:
+            click.echo(
+                f"[sample-hash] no {sig_path.name} found - proceeding on "
+                f"trust.  (For the full safety net, run evaluate_combos "
+                f"first with the same --where-to-write.)"
+            )
+
+        compressors = utils.compressor_space(sample_for_codec_space, with_lossy,
+                                             with_numcodecs_wasm, with_ebcc, compressor_class)
+        filters     = utils.filter_space(sample_for_codec_space, with_lossy,
+                                         with_numcodecs_wasm, with_ebcc, filter_class)
+        serializers = utils.serializer_space(sample_for_codec_space, with_lossy,
+                                             with_numcodecs_wasm, with_ebcc, serializer_class)
+
+        # Index validation
+        for name, idx, arr in [("comp_idx", comp_idx, compressors),
+                               ("filt_idx", filt_idx, filters),
+                               ("ser_idx",  ser_idx,  serializers)]:
+            if not (-1 <= idx < len(arr)):
+                click.echo(f"Invalid {name}: {idx} (must be in [-1, {len(arr) - 1}])")
+                sys.exit(1)
+
+        optimal_compressor = compressors[comp_idx][1] if comp_idx != -1 else None
+        optimal_filter     = filters[filt_idx][1]     if filt_idx != -1 else None
+        optimal_serializer = serializers[ser_idx][1]  if ser_idx  != -1 else None
+
+        # Per-serializer data shaping (on the FULL field, not the sample).
+        # zfpy and EBCC have incompatible shape expectations; at most one branch
+        # fires (elif documents that intent).
+        data_to_persist = da
+        if _is_zfpy_serializer(optimal_serializer):
+            data_to_persist = da.stack(flat_dim=da.dims)
+        elif _is_ebcc_serializer(optimal_serializer):
+            data_to_persist = da.squeeze().astype("float32")
+
+        # Pipeline assembly rules (same semantics as the original)
+        filters_ = [optimal_filter]
+        compressors_ = [optimal_compressor]
+        serializer_ = optimal_serializer
+        if isinstance(serializer_, AnyNumcodecsArrayBytesCodec) or optimal_filter is None:
+            filters_ = None
+        if optimal_compressor is None:
+            compressors_ = None
+        if optimal_serializer is None:
+            serializer_ = "auto"
+
+        # Compute sharding geometry for the FULL field.  Passes dim names so
+        # vertical-like dims are kept whole when spatial splitting is needed
+        # (hiopy approach).  shards may come back as None -- that signals
+        # "skip sharding" because one inner chunk already meets the shard
+        # target (a shard would bundle <= 1 chunk and add only index overhead).
+        inner_chunks, shards = utils.compute_chunk_and_shard_shape(
+            data_to_persist.shape, data_to_persist.dtype,
+            inner_mib=inner_chunk_mib, shard_mib=shard_mib,
+            dims=tuple(data_to_persist.dims),
+            max_inner_mib=max_inner_chunk_mib,
+            allow_spatial_split=spatial_split,
+        )
+
+        _itemsize = int(data_to_persist.dtype.itemsize)
+        _inner_bytes = _itemsize * int(np.prod(inner_chunks))
+        _shard_bytes = (_itemsize * int(np.prod(shards))
+                        if shards is not None else _inner_bytes)
+        if (not spatial_split
+                and _inner_bytes > max_inner_chunk_mib * 2**20
+                and rank == 0):
+            click.echo(
+                f"[chunks] WARNING: --no-spatial-split produced an inner chunk of "
+                f"{humanize.naturalsize(_inner_bytes, binary=True)} "
+                f"(shape {inner_chunks}), exceeding --max-inner-chunk-mib "
+                f"({max_inner_chunk_mib} MiB).  Codec internals (zstd block "
+                f"limit, blosc memory) may misbehave at this size.  Consider "
+                f"re-enabling spatial splitting."
+            )
+
+        # Open (or create) the shared merged store.  mode='a' means new fields are
+        # added alongside any fields previously written.
+        merged_path = _merged_store_path(where_to_write, dataset_file)
+        os.makedirs(Path(merged_path).parent, exist_ok=True)
+        store = zarr.storage.LocalStore(merged_path, read_only=False)
+        # Ensure a root group exists.  If the store is corrupted or the path is
+        # unwritable we want to surface that now, not deep inside persist_with_codec_pipeline.
+        try:
+            zarr.open_group(store, mode="a", zarr_format=3)
+        except Exception as e:
+            click.echo(
+                f"[persist] ERROR: cannot open or create zarr group at {merged_path}: {e}"
+            )
+            raise
+
+        if shards is None:
+            click.echo(
+                f"[persist] {field_to_compress} -> {merged_path} "
+                f"(inner chunks={inner_chunks}, "
+                f"{humanize.naturalsize(_inner_bytes, binary=True)}; "
+                f"sharding skipped -- one chunk >= shard target)"
+            )
+        else:
+            click.echo(
+                f"[persist] {field_to_compress} -> {merged_path} "
+                f"(inner chunks={inner_chunks}, "
+                f"{humanize.naturalsize(_inner_bytes, binary=True)}; "
+                f"shards={shards}, "
+                f"{humanize.naturalsize(_shard_bytes, binary=True)})"
+            )
+
+        # Refined memory guardrail using ACTUAL write-unit bytes (one task =
+        # one shard if sharded, one chunk otherwise).  The earlier check at
+        # the top of the dask context used `threads * shard_mib` as an
+        # upper-bound estimate, but that can under-count when chunks are
+        # oversized (--no-spatial-split + huge timestep) or over-count when
+        # the field is small.  Now that we know the real geometry, re-check.
+        _write_unit_bytes = _shard_bytes  # == _inner_bytes when shards is None
+        _real_write_peak = min(int(threads) * int(_write_unit_bytes), field_bytes)
+        _check_memory_headroom(
+            _real_write_peak,
+            label=f"compress_with_optimal real write peak for "
+                  f"'{field_to_compress}' (threads x write-unit-bytes = "
+                  f"{humanize.naturalsize(_real_write_peak, binary=True)})",
+            threshold=memory_threshold,
+        )
+
+        persist_t0 = time.perf_counter()
+        ratio, errors, eucd = utils.persist_with_codec_pipeline(
+            data_to_persist, store,
+            component=field_to_compress,
+            filters=filters_, compressors=compressors_, serializer=serializer_,
+            inner_chunks=inner_chunks, shards=shards,
+            verify=verify, verbose=False, rank=rank,
+        )
+        persist_seconds = time.perf_counter() - persist_t0
+
+        # Compose the summary.  Error metrics are only defined when --verify is on
+        # (persist_with_codec_pipeline returns errors=None, eucd=None otherwise),
+        # so we gate that tail of the message rather than crashing on None indexing.
+        summary = (
+            "optimal combo:\n"
+            f"compressor : {optimal_compressor}\n"
+            f"filter     : {optimal_filter}\n"
+            f"serializer : {optimal_serializer}\n"
+            "corresponding indices in lists of instantiated objects:\n"
+            f"compressor : {comp_idx}\n"
+            f"filter     : {filt_idx}\n"
+            f"serializer : {ser_idx}\n"
+            f"Compression Ratio: {ratio:.3f}"
+        )
+        if verify:
+            summary += (
+                f" | Relative L1 Error: {errors['Relative_Error_L1']:.3e}"
+                f" | Euclidean Distance: {eucd:.3e}"
+            )
+        else:
+            summary += "  (error metrics skipped: --no-verify)"
+        click.echo(summary)
+
+        # ------------------------------------------------------------------
+        # Per-field persist manifest (machine-readable).
+        # Mirrors the evaluate_combos manifest so a downstream tool can pick
+        # up the exact combo that was written, which shards were produced,
+        # and how long it took.
+        # ------------------------------------------------------------------
+        persist_manifest = {
+            "command": "compress_with_optimal",
+            "dataset_file": os.fspath(dataset_file),
+            "var": str(field_to_compress),
+            "where_to_write": os.fspath(where_to_write),
+            "merged_store": merged_path,
+            "args": {
+                "comp_idx": int(comp_idx),
+                "filt_idx": int(filt_idx),
+                "ser_idx":  int(ser_idx),
+                "eval_data_size_limit": int(eval_data_size_limit),
+                "inner_chunk_mib": int(inner_chunk_mib),
+                "max_inner_chunk_mib": int(max_inner_chunk_mib),
+                "spatial_split": bool(spatial_split),
+                "shard_mib": int(shard_mib),
+                "threads": int(threads),
+                "verify": bool(verify),
+                "force": bool(force),
+                "compressor_class": compressor_class,
+                "filter_class": filter_class,
+                "serializer_class": serializer_class,
+            },
+            "inner_chunks": list(inner_chunks),
+            "inner_chunk_bytes": int(_inner_bytes),
+            "shards": (list(shards) if shards is not None else None),
+            "shard_bytes": (int(_shard_bytes) if shards is not None else None),
+            "sharding_skipped": bool(shards is None),
+            "compressor": str(optimal_compressor),
+            "filter":     str(optimal_filter),
+            "serializer": str(optimal_serializer),
+            "ratio": float(ratio),
+            "errors": {k: float(v) for k, v in (errors or {}).items()},
+            "eucd": (float(eucd) if eucd is not None else None),
+            "persist_seconds": float(persist_seconds),
+            "env": {
+                "zarr": getattr(zarr, "__version__", None),
+                "numpy": getattr(np, "__version__", None),
+                "dask": getattr(dask, "__version__", None),
+            },
+        }
+        persist_manifest_path = os.path.join(
+            where_to_write, f"persist_manifest_{field_to_compress}.json"
+        )
+        try:
+            with open(persist_manifest_path, "w") as pmf:
+                json.dump(persist_manifest, pmf, indent=2, default=str)
+            click.echo(f"[persist] wrote manifest -> {persist_manifest_path}")
+        except Exception as persist_manifest_err:
+            click.echo(
+                f"[persist] WARNING: could not write manifest "
+                f"{persist_manifest_path}: {persist_manifest_err}"
+            )
+
+        # Release the LocalStore handles.  At CLI-shape this is cosmetic (the
+        # process exits next), but keeps the function well-behaved when
+        # imported and called from a longer-lived process.
+        close = getattr(store, "close", None)
+        if callable(close):
+            try:
+                close()
+            except Exception:
+                pass
+
+
+@cli.command("compress_fields_from_results")
+@click.argument("dataset_file", type=click.Path(exists=True, dir_okay=True, file_okay=True))
+@click.argument("where_to_write", type=click.Path(dir_okay=True, file_okay=False, exists=True))
+@click.option("--vars", "vars_filter", default=None,
+              help="Comma-separated list of variable names to process. "
+                   "Default: every variable for which a results_{var}.parquet "
+                   "or manifest_{var}.json exists in --where-to-write.")
+@click.option("--eval-data-size-limit", default="5GB", callback=_size_option_callback,
+              show_default=True,
+              help="Must match the value used in the prior evaluate_combos run.")
+@click.option("--inner-chunk-mib", type=int, default=16, show_default=True)
+@click.option("--max-inner-chunk-mib", type=int, default=256, show_default=True,
+              help="Hard ceiling on inner chunk size (MiB).  Triggers a warning "
+                   "if exceeded under --no-spatial-split.")
+@click.option("--spatial-split/--no-spatial-split", default=True, show_default=True,
+              help="When one timestep already exceeds --inner-chunk-mib, split "
+                   "spatial dims (horizontal/cell first, vertical last) until "
+                   "the chunk fits target.  Pass --no-spatial-split to keep "
+                   "one timestep per chunk regardless of size (hiopy-style "
+                   "layout; sharding is then skipped automatically).")
+@click.option("--shard-mib", type=int, default=512, show_default=True)
+@click.option("--threads", type=int, default=None,
+              help="Dask workers for the write.  Default: auto-detected.")
+@click.option("--codec-threads", type=int, default=1, show_default=True,
+              help="Internal threads per codec call (Blosc set live; for "
+                   "OpenMP/MKL/OpenBLAS export the matching env vars in the "
+                   "shell BEFORE running). --threads * --codec-threads must "
+                   "be <= physical cores; oversubscription-check is skipped "
+                   "when this is > 1.")
+@click.option("--oversubscription-check/--no-oversubscription-check", default=True,
+              show_default=True)
+@click.option("--memory-threshold", type=click.FloatRange(0.05, 0.95), default=0.80,
+              show_default=True,
+              help="Fraction of currently-available RAM that any single tracked "
+                   "allocation is allowed to occupy before the run is aborted.  "
+                   "Defaults to 0.80; values above 0.80 emit a one-time warning "
+                   "because the documented 1.5-2x rechunk transient can exceed "
+                   "the remaining buffer.  Hard upper bound 0.95.")
+@click.option("--verify/--no-verify", default=True, show_default=True)
+@click.option("--compressor-class", default="all")
+@click.option("--filter-class", default="all")
+@click.option("--serializer-class", default="all")
+@click.option("--with-lossy/--without-lossy", default=True, show_default=True)
+@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=False, show_default=True)
+@click.option("--with-ebcc/--without-ebcc", default=False, show_default=True)
+@click.option("--skip-existing/--no-skip-existing", default=True, show_default=True,
+              help="If a field is already present in the merged store, skip it. "
+                   "Disable with --no-skip-existing to force re-compression.")
+@click.option("--continue-on-error/--no-continue-on-error", default=True, show_default=True,
+              help="If compressing one field fails, log and continue with the "
+                   "rest (default).  Disable to fail the whole run on first error.")
+def compress_fields_from_results(dataset_file, where_to_write, vars_filter,
+                                  eval_data_size_limit, inner_chunk_mib,
+                                  max_inner_chunk_mib, spatial_split, shard_mib,
+                                  threads, codec_threads,
+                                  oversubscription_check, memory_threshold,
+                                  verify,
+                                  compressor_class, filter_class, serializer_class,
+                                  with_lossy, with_numcodecs_wasm, with_ebcc,
+                                  skip_existing, continue_on_error):
+    """
+    Batch wrapper around compress_with_optimal.
+
+    Reads the best (comp_idx, filt_idx, ser_idx) per variable from
+    `manifest_{var}.json` (preferred) or `results_{var}.parquet` (fallback),
+    then compresses each variable into the shared `{dataset}.zarr` store.
+    The dataset is opened once and re-used across variables.
+
+    This is the command most production pipelines want after an
+    `evaluate_combos` run - it closes the loop without forcing the user to
+    glue together N per-field invocations by hand.
+
+    Launch as a SINGLE process (no mpirun).  Parallelism inside the write is
+    provided by dask's threaded scheduler, same as compress_with_optimal.
     """
     comm = MPI.COMM_WORLD
     rank = comm.Get_rank()
     size = comm.Get_size()
     if size > 1:
         if rank == 0:
-            click.echo("This command is not meant to be run in parallel. Please run it with a single process.")
+            click.echo("compress_fields_from_results is not meant to run in parallel. "
+                       "Launch it with a single process.")
+        comm.Abort(1)
+
+    click.echo(_version_banner("compress_fields_from_results"))
+
+    # Thread + dask config (same pattern as compress_with_optimal)
+    _reset_memcheck_state()
+    cores_avail = utils.detect_cores_available()
+    if threads is None:
+        threads = cores_avail
+    _apply_codec_threads(codec_threads)
+    _check_thread_product(threads, codec_threads)
+    if int(codec_threads or 1) <= 1:
+        utils.check_thread_oversubscription(
+            abort_if_unsafe=oversubscription_check, rank=rank,
+        )
+    # Per-variable memory checks (write peak AND codec-space sample) happen
+    # inside the loop below, once we know each variable's actual size.
+    # A single up-front `threads * shard_mib` check would spuriously abort
+    # on small fields (e.g. tigge files where the field is smaller than
+    # one shard) on memory-constrained nodes; a sum-based pre-check would
+    # miss that variables are processed sequentially, not concurrently.
+
+    # Resolve (var, comp_idx, filt_idx, ser_idx) from where_to_write.  Prefer
+    # manifest_{var}.json; fall back to best-ratio in results_{var}.parquet.
+    # Capture sweep env from the first manifest for a one-time version warn.
+    wtw = Path(where_to_write)
+    candidates = []
+    sweep_env = None
+    for mpath in sorted(wtw.glob("manifest_*.json")):
+        var_name = mpath.stem.removeprefix("manifest_")
+        try:
+            m = json.loads(mpath.read_text())
+            best = m.get("best")
+            if best is None:
+                click.echo(f"[batch] {var_name}: manifest has no best combo; skipping.")
+                continue
+            candidates.append({
+                "var": var_name,
+                "comp_idx": int(best["comp_idx"]),
+                "filt_idx": int(best["filt_idx"]),
+                "ser_idx":  int(best["ser_idx"]),
+                "source": f"manifest {mpath.name}",
+            })
+            if sweep_env is None and m.get("env"):
+                sweep_env = m["env"]
+        except Exception as e:
+            click.echo(f"[batch] WARNING: failed to parse {mpath}: {e}")
+
+    # One-shot library-version cross-check.  Same rationale as the one in
+    # compress_with_optimal: if zarr/numpy/dask differ between the sweep
+    # and now, the sample bytes may shift (decode path) and the per-var
+    # signature checks in the loop below may trip for environmental rather
+    # than user reasons.  Reporting here connects the two for the user.
+    if sweep_env:
+        current_env = {
+            "zarr":  getattr(zarr, "__version__", None),
+            "numpy": getattr(np,   "__version__", None),
+            "dask":  getattr(dask, "__version__", None),
+        }
+        env_deltas = [
+            (pkg, sweep_env.get(pkg), current_env.get(pkg))
+            for pkg in ("zarr", "numpy", "dask")
+            if sweep_env.get(pkg) is not None
+            and sweep_env.get(pkg) != current_env.get(pkg)
+        ]
+        if env_deltas:
+            click.echo(
+                "[batch] WARNING: library versions differ from the sweep "
+                "that wrote these manifests:"
+            )
+            for pkg, sweep_ver, now_ver in env_deltas:
+                click.echo(f"  {pkg}: sweep={sweep_ver}  now={now_ver}")
+            click.echo(
+                "  If per-variable signature checks below report hash "
+                "mismatches, a decode-path change across these versions "
+                "is a likely cause.  Either rerun evaluate_combos in the "
+                "current environment, or switch back to the sweep's "
+                "environment."
+            )
+
+    # Any parquet files without a companion manifest? Take best-ratio from them.
+    known_vars = {c["var"] for c in candidates}
+    for ppath in sorted(wtw.glob("results_*.parquet")):
+        var_name = ppath.stem.removeprefix("results_")
+        if var_name in known_vars:
+            continue
+        try:
+            dfp = pd.read_parquet(ppath)
+            kept = dfp[dfp["keep"] == True] if "keep" in dfp.columns else dfp
+            if len(kept) == 0:
+                click.echo(f"[batch] {var_name}: no kept rows in {ppath.name}; skipping.")
+                continue
+            best_row = kept.sort_values("ratio", ascending=False).iloc[0]
+            candidates.append({
+                "var": var_name,
+                "comp_idx": int(best_row["comp_idx"]),
+                "filt_idx": int(best_row["filt_idx"]),
+                "ser_idx":  int(best_row["ser_idx"]),
+                "source": f"parquet {ppath.name}",
+            })
+        except Exception as e:
+            click.echo(f"[batch] WARNING: failed to parse {ppath}: {e}")
+
+    if vars_filter:
+        wanted = set(v.strip() for v in vars_filter.split(",") if v.strip())
+        candidates = [c for c in candidates if c["var"] in wanted]
+        missing = wanted - {c["var"] for c in candidates}
+        if missing:
+            click.echo(
+                f"[batch] WARNING: --vars specified {sorted(missing)} "
+                f"but no manifest/parquet was found for those."
+            )
+
+    if not candidates:
+        click.echo(
+            "[batch] ERROR: no variables to compress. Did evaluate_combos run "
+            "against the same --where-to-write?"
+        )
         sys.exit(1)
 
-    os.makedirs(where_to_write, exist_ok=True)
+    click.echo(
+        f"[batch] will compress {len(candidates)} field(s): "
+        f"{', '.join(c['var'] for c in candidates)}"
+    )
 
-    ds = utils.open_dataset(dataset_file, field_to_compress)
-    da = ds[field_to_compress]
+    # Open dataset ONCE; pass the same da to each iteration.
+    ds = utils.open_dataset(dataset_file, field_to_compress=None, rank=rank)
 
-    compressors = utils.compressor_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, compressor_class)
-    filters = utils.filter_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, filter_class)
-    serializers = utils.serializer_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, serializer_class)
+    merged_path = _merged_store_path(where_to_write, dataset_file)
+    os.makedirs(Path(merged_path).parent, exist_ok=True)
 
-    if -1 <= comp_idx < len(compressors):
-        pass
-    else:
-        click.echo(f"Invalid comp_idx: {comp_idx}")
-        sys.exit(1)
-    if -1 <= filt_idx < len(filters):
-        pass
-    else:
-        click.echo(f"Invalid filt_idx: {filt_idx}")
-        sys.exit(1)
-    if -1 <= ser_idx < len(serializers):
-        pass
-    else:
-        click.echo(f"Invalid ser_idx: {ser_idx}")
-        sys.exit(1)
+    # Inspect the merged store (if any) to honor --skip-existing.
+    existing_arrays = set()
+    if Path(merged_path).is_dir():
+        try:
+            store_ro = zarr.storage.LocalStore(merged_path, read_only=True)
+            g_ro = zarr.open_group(store_ro, mode="r")
+            existing_arrays = set(g_ro.array_keys())
+            close_ro = getattr(store_ro, "close", None)
+            if callable(close_ro):
+                try:
+                    close_ro()
+                except Exception:
+                    pass
+        except Exception:
+            pass
+
+    # ---- per-field loop ----
+    results_by_var = {}
+    any_error = False
+
+    with dask.config.set(scheduler="threads", num_workers=int(threads)):
+        for idx, c in enumerate(candidates, start=1):
+            var = c["var"]
+            click.echo(
+                f"\n[batch] ({idx}/{len(candidates)}) {var} from {c['source']}: "
+                f"comp={c['comp_idx']} filt={c['filt_idx']} ser={c['ser_idx']}"
+            )
+            if skip_existing and var in existing_arrays:
+                click.echo(f"[batch] {var} already in {merged_path}; skipping.")
+                results_by_var[var] = {"status": "skipped-existing"}
+                continue
+            if var not in ds.data_vars:
+                msg = f"variable '{var}' not in dataset"
+                if continue_on_error:
+                    click.echo(f"[batch] WARNING: {msg}; skipping.")
+                    results_by_var[var] = {"status": "missing-from-dataset"}
+                    continue
+                else:
+                    click.echo(f"[batch] ERROR: {msg}; aborting (use "
+                               f"--continue-on-error to skip).")
+                    sys.exit(1)
+
+            try:
+                field_t0 = time.perf_counter()
+                da = ds[var]
+
+                # Per-variable memory guards: check the ACTUAL allocation
+                # size against available RAM, not configuration upper bounds.
+                # Both the write peak (threads * shard_mib) and the codec-
+                # space sample (eval_data_size_limit) are upper bounds; for
+                # fields smaller than those bounds, the real allocation is
+                # capped by field_bytes.  Aborting on the upper bound would
+                # spuriously trip on tiny fields (e.g. tigge dx=2) on
+                # memory-constrained nodes.
+                field_bytes = int(da.dtype.itemsize) * int(np.prod(da.shape))
+
+                write_peak_bytes = min(
+                    int(threads) * int(shard_mib) * 2**20,
+                    field_bytes,
+                )
+                _check_memory_headroom(
+                    write_peak_bytes,
+                    label=f"write peak for '{var}' "
+                          f"(min(threads x shard_mib, field bytes) = "
+                          f"{humanize.naturalsize(write_peak_bytes, binary=True)})",
+                    threshold=memory_threshold,
+                )
+
+                actual_sample_bytes = min(field_bytes, int(eval_data_size_limit))
+                _check_memory_headroom(
+                    actual_sample_bytes,
+                    label=f"codec-space sample for '{var}' "
+                          f"({humanize.naturalsize(actual_sample_bytes, binary=True)})",
+                    threshold=memory_threshold,
+                )
 
-    optimal_compressor = compressors[comp_idx][1] if comp_idx != -1 else None
-    optimal_filter = filters[filt_idx][1] if filt_idx != -1 else None
-    optimal_serializer = serializers[ser_idx][1] if ser_idx != -1 else None
+                # Build codec space from the sample (same contract as
+                # compress_with_optimal).  Verify against signature when present.
+                sample_for_codec_space = utils.build_representative_sample(
+                    da, eval_data_size_limit,
+                ).compute()
+
+                sig_path = _signature_path(where_to_write, str(var))
+                if sig_path.is_file():
+                    try:
+                        expected = json.loads(sig_path.read_text())
+                        sample_np_view = np.ascontiguousarray(
+                            sample_for_codec_space.values
+                        )
+                        observed = _sample_signature(
+                            dataset_file=dataset_file,
+                            var=str(var),
+                            eval_data_size_limit=int(eval_data_size_limit),
+                            sample_np=sample_np_view,
+                        )
+                        mismatches = [
+                            f for f in (
+                                "dataset_stem", "var", "eval_data_size_limit",
+                                "shape", "dtype", "nbytes", "sha256",
+                            )
+                            if expected.get(f) != observed.get(f)
+                        ]
+                        del sample_np_view
+                        if mismatches:
+                            click.echo(
+                                f"[sample-hash] MISMATCH for {var}: "
+                                f"differing fields = {mismatches}"
+                            )
+                            if continue_on_error:
+                                click.echo(
+                                    f"[batch] skipping {var} (use matching "
+                                    f"--eval-data-size-limit to fix)."
+                                )
+                                results_by_var[var] = {"status": "signature-mismatch"}
+                                continue
+                            sys.exit(1)
+                    except Exception as sig_err:
+                        click.echo(
+                            f"[sample-hash] WARNING {var}: {sig_err}; proceeding."
+                        )
+
+                compressors = utils.compressor_space(
+                    sample_for_codec_space, with_lossy, with_numcodecs_wasm,
+                    with_ebcc, compressor_class,
+                )
+                filters_space = utils.filter_space(
+                    sample_for_codec_space, with_lossy, with_numcodecs_wasm,
+                    with_ebcc, filter_class,
+                )
+                serializers = utils.serializer_space(
+                    sample_for_codec_space, with_lossy, with_numcodecs_wasm,
+                    with_ebcc, serializer_class,
+                )
 
-    if isinstance(optimal_serializer, numcodecs.zarr3.ZFPY):
-        da = da.stack(flat_dim=da.dims)
-    if isinstance(optimal_serializer, AnyNumcodecsArrayBytesCodec) and isinstance(optimal_serializer.codec, EBCCZarrFilter):
-        da = da.squeeze().astype("float32")
+                comp_idx = c["comp_idx"]; filt_idx = c["filt_idx"]; ser_idx = c["ser_idx"]
+                for name, idx2, arr in [("comp_idx", comp_idx, compressors),
+                                        ("filt_idx", filt_idx, filters_space),
+                                        ("ser_idx",  ser_idx,  serializers)]:
+                    if not (-1 <= idx2 < len(arr)):
+                        raise IndexError(
+                            f"Invalid {name}: {idx2} (must be in [-1, {len(arr)-1}]) for {var}"
+                        )
+
+                optimal_compressor = compressors[comp_idx][1] if comp_idx != -1 else None
+                optimal_filter     = filters_space[filt_idx][1] if filt_idx != -1 else None
+                optimal_serializer = serializers[ser_idx][1]  if ser_idx  != -1 else None
+
+                data_to_persist = da
+                if _is_zfpy_serializer(optimal_serializer):
+                    data_to_persist = da.stack(flat_dim=da.dims)
+                elif _is_ebcc_serializer(optimal_serializer):
+                    data_to_persist = da.squeeze().astype("float32")
+
+                filters_ = [optimal_filter]
+                compressors_ = [optimal_compressor]
+                serializer_ = optimal_serializer
+                if isinstance(serializer_, AnyNumcodecsArrayBytesCodec) or optimal_filter is None:
+                    filters_ = None
+                if optimal_compressor is None:
+                    compressors_ = None
+                if optimal_serializer is None:
+                    serializer_ = "auto"
+
+                inner_chunks, shards = utils.compute_chunk_and_shard_shape(
+                    data_to_persist.shape, data_to_persist.dtype,
+                    inner_mib=inner_chunk_mib, shard_mib=shard_mib,
+                    dims=tuple(data_to_persist.dims),
+                    max_inner_mib=max_inner_chunk_mib,
+                    allow_spatial_split=spatial_split,
+                )
 
-    filters_ = [optimal_filter,]
-    compressors_ = [optimal_compressor,]
-    serializer_ = optimal_serializer
+                _itemsize = int(data_to_persist.dtype.itemsize)
+                _inner_bytes = _itemsize * int(np.prod(inner_chunks))
+                _shard_bytes = (_itemsize * int(np.prod(shards))
+                                if shards is not None else _inner_bytes)
+                if (not spatial_split
+                        and _inner_bytes > max_inner_chunk_mib * 2**20
+                        and rank == 0):
+                    click.echo(
+                        f"[chunks] WARNING: --no-spatial-split for '{var}' "
+                        f"produced an inner chunk of "
+                        f"{humanize.naturalsize(_inner_bytes, binary=True)} "
+                        f"(shape {inner_chunks}), exceeding "
+                        f"--max-inner-chunk-mib ({max_inner_chunk_mib} MiB).  "
+                        f"Codec internals may misbehave at this size."
+                    )
+
+                # Refined memory guardrail using ACTUAL write-unit bytes.
+                _write_unit_bytes = _shard_bytes
+                _real_write_peak = min(int(threads) * int(_write_unit_bytes),
+                                       field_bytes)
+                _check_memory_headroom(
+                    _real_write_peak,
+                    label=f"real write peak for '{var}' "
+                          f"(threads x write-unit-bytes = "
+                          f"{humanize.naturalsize(_real_write_peak, binary=True)})",
+                    threshold=memory_threshold,
+                )
 
-    if isinstance(serializer_, AnyNumcodecsArrayBytesCodec) or optimal_filter is None:
-        filters_ = None
-    if optimal_compressor is None:
-        compressors_ = None
-    if optimal_serializer is None:
-        serializer_ = "auto"
+                store = zarr.storage.LocalStore(merged_path, read_only=False)
+                try:
+                    try:
+                        zarr.open_group(store, mode="a", zarr_format=3)
+                    except Exception as e:
+                        click.echo(
+                            f"[persist] ERROR opening group at {merged_path}: {e}"
+                        )
+                        raise
+                    if shards is None:
+                        click.echo(
+                            f"[persist] {var} -> {merged_path} "
+                            f"(inner chunks={inner_chunks}, "
+                            f"{humanize.naturalsize(_inner_bytes, binary=True)}; "
+                            f"sharding skipped)"
+                        )
+                    else:
+                        click.echo(
+                            f"[persist] {var} -> {merged_path} "
+                            f"(inner chunks={inner_chunks}, "
+                            f"{humanize.naturalsize(_inner_bytes, binary=True)}; "
+                            f"shards={shards}, "
+                            f"{humanize.naturalsize(_shard_bytes, binary=True)})"
+                        )
+                    ratio, errors, eucd = utils.persist_with_codec_pipeline(
+                        data_to_persist, store,
+                        component=var,
+                        filters=filters_, compressors=compressors_, serializer=serializer_,
+                        inner_chunks=inner_chunks, shards=shards,
+                        verify=verify, verbose=False, rank=rank,
+                    )
+                finally:
+                    close = getattr(store, "close", None)
+                    if callable(close):
+                        try:
+                            close()
+                        except Exception:
+                            pass
+
+                field_seconds = time.perf_counter() - field_t0
+                summary = f"{var}: ratio={ratio:.3f}"
+                if verify:
+                    summary += (
+                        f" L1_rel={errors['Relative_Error_L1']:.3e} "
+                        f"eucd={eucd:.3e}"
+                    )
+                summary += f"  ({field_seconds:.1f}s)"
+                click.echo(f"[batch] {summary}")
+                results_by_var[var] = {
+                    "status": "ok",
+                    "ratio": float(ratio),
+                    "errors": {k: float(v) for k, v in (errors or {}).items()},
+                    "eucd": (float(eucd) if eucd is not None else None),
+                    "seconds": float(field_seconds),
+                    "comp_idx": int(comp_idx),
+                    "filt_idx": int(filt_idx),
+                    "ser_idx":  int(ser_idx),
+                }
+
+            except Exception as field_err:
+                any_error = True
+                click.echo(f"[batch] ERROR on {var}: {field_err!r}")
+                results_by_var[var] = {"status": "error", "error": repr(field_err)}
+                if not continue_on_error:
+                    raise
+
+    # Summary manifest for the whole batch.
+    batch_manifest_path = os.path.join(where_to_write, "batch_manifest.json")
+    try:
+        with open(batch_manifest_path, "w") as bmf:
+            json.dump(
+                {
+                    "command": "compress_fields_from_results",
+                    "dataset_file": os.fspath(dataset_file),
+                    "where_to_write": os.fspath(where_to_write),
+                    "merged_store": merged_path,
+                    "results": results_by_var,
+                    "any_error": any_error,
+                },
+                bmf, indent=2, default=str,
+            )
+        click.echo(f"\n[batch] wrote summary -> {batch_manifest_path}")
+    except Exception as bmf_err:
+        click.echo(f"[batch] WARNING: could not write batch manifest: {bmf_err}")
 
-    compression_ratio, errors, euclidean_distance = utils.compress_with_zarr(
-        da,
-        dataset_file,
-        field_to_compress,
-        where_to_write,
-        filters=filters_,
-        compressors=compressors_,
-        serializer=serializer_,
-        verbose=False,
-    )
+    if any_error and not continue_on_error:
+        sys.exit(1)
 
-    msg = (
-        "optimal combo: \n"
-        f"compressor : {optimal_compressor}\nfilter     : {optimal_filter}\nserializer : {optimal_serializer}\n"
-        "corresponding indices in lists of instantiated objects:\n"
-        f"compressor : {comp_idx}\nfilter     : {filt_idx}\nserializer : {ser_idx}\n"
-        f"Compression Ratio: {compression_ratio:.3f} | Relative L1 Error: {errors['Relative_Error_L1']:.3e} | Euclidean Distance: {euclidean_distance:.3e}"
-    )
-    click.echo(msg)
 
 @cli.command("merge_compressed_fields")
 @click.argument("dataset_file", type=click.Path(exists=True, dir_okay=True, file_okay=True))
 @click.argument("compressed_files_location", type=click.Path(dir_okay=True, file_okay=False, exists=False))
 def merge_compressed_fields(dataset_file: str, compressed_files_location: str):
     """
-    Once all fields have been compressed, this command merges them into a single Zarr Zipped file.
+    Consolidate metadata on the shared {dataset}.zarr store.
+
+    Under the new design, `compress_with_optimal` writes each field directly
+    into a shared LocalStore, so the old unzip+copy+rezip merge is unnecessary.
+    This command just runs `zarr.consolidate_metadata` so downstream readers
+    can open the store quickly without scanning every array.
 
-    \b
     Args:
-        dataset_file (str): Path to the input dataset file.
-        compressed_files_location (str): Directory where the compressed files are located.
+        dataset_file (str): Path to the original dataset file (used to
+            derive the name of the merged .zarr store).
+        compressed_files_location (str): Directory containing
+            {dataset_basename}.zarr.
     """
     comm = MPI.COMM_WORLD
     rank = comm.Get_rank()
     size = comm.Get_size()
     if size > 1:
         if rank == 0:
-            click.echo("This command is not meant to be run in parallel. Please run it with a single process.")
+            click.echo("merge_compressed_fields is not meant to run in parallel.")
+        # Collective abort: sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
+
+    merged_path = _merged_store_path(compressed_files_location, dataset_file)
+    if not Path(merged_path).is_dir():
+        click.echo(f"Expected merged store not found: {merged_path}")
+        click.echo("Did compress_with_optimal run at least once with the same "
+                   "`where_to_write`?")
         sys.exit(1)
 
-    # populate this folder with the compressed fields
-    dataset_filename = Path(dataset_file).name
-    merged_folder = Path(compressed_files_location) / f"{dataset_filename}.zarr"
-    if Path(merged_folder).exists():
-        shutil.rmtree(merged_folder)
-    os.makedirs(merged_folder)
-
-    for var in utils.open_dataset(dataset_file).data_vars:
-        compressed_field = f"{Path(compressed_files_location) / dataset_filename}.=.field_{var}.=.rank_{rank}.zarr.zip"
-        if not Path(compressed_field).exists():
-            click.echo("All fields must be compressed first.")
-            sys.exit(1)
-        extract_to = utils.unzip_file(compressed_field)
-        utils.copy_folder_contents(extract_to, merged_folder)
-        shutil.rmtree(extract_to)
+    # Open in a try/finally so the LocalStore handles are released even if
+    # consolidate_metadata or the subsequent array listing raises.  Zarr v3's
+    # LocalStore holds open file descriptors; at CLI-shape the OS would reap
+    # them on process exit, but merging via an imported function (notebook /
+    # longer-lived process) would leak them without an explicit close.
+    store = zarr.storage.LocalStore(merged_path, read_only=False)
+    try:
+        zarr.consolidate_metadata(store)
+        click.echo(f"[merge] consolidated metadata on {merged_path}")
 
-    zipped_merged_folder = str(merged_folder) + ".zip"
-    if Path(zipped_merged_folder).exists():
-        os.remove(zipped_merged_folder)
-    shutil.make_archive(merged_folder, 'zip', merged_folder)
+        # Report what's inside
+        g = zarr.open_group(store, mode="r")
+        arr_names = list(g.array_keys())
+        click.echo(f"[merge] arrays in store ({len(arr_names)}): {', '.join(arr_names) or '<none>'}")
+    finally:
+        close = getattr(store, "close", None)
+        if callable(close):
+            try:
+                close()
+            except Exception:
+                pass
 
-    if Path(merged_folder).exists():
-        shutil.rmtree(merged_folder)
 
+@cli.command("open_zarr_and_inspect")
+@click.argument("zarr_path", type=click.Path(exists=True, dir_okay=True, file_okay=False))
+@click.option("--head", type=int, default=4, show_default=True,
+              help="Per-array head slice size (across each dim) for a tiny preview. "
+                   "Set to 0 to skip reading any data.")
+def open_zarr_and_inspect(zarr_path: str, head: int):
+    """
+    Inspect a zarr v3 LocalStore without materialising full arrays.
 
-@cli.command("open_zarr_zip_file_and_inspect")
-@click.argument("zarr_zip_file", type=click.Path(exists=True, dir_okay=False))
-def open_zarr_zip_file_and_inspect(zarr_zip_file: str):
+    Shows: group tree, per-array metadata (shape, dtype, codecs, sharding,
+    compression ratio from info_complete), and a tiny head slice.
     """
-    Open a Zarr Zipped file and inspect its contents.
+    comm = MPI.COMM_WORLD
+    rank = comm.Get_rank()
+    size = comm.Get_size()
+    if size > 1:
+        if rank == 0:
+            click.echo("open_zarr_and_inspect is not meant to run in parallel.")
+        # Collective abort: sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
 
-    \b
-    Args:
-        zarr_zip_file (str): Path to the Zarr file.
+    group, store = utils.open_zarr_localstore(zarr_path, read_only=True)
+    click.echo(group.tree())
+    click.echo("-" * 80)
+
+    for array_name in group.array_keys():
+        z = group[array_name]
+        click.echo(f"Array: {array_name}")
+        click.echo(z.info_complete())
+        if head > 0:
+            slicer = tuple(slice(0, min(head, s)) for s in z.shape)
+            click.echo(f"Head slice {slicer}:")
+            click.echo(z[slicer])
+        click.echo("-" * 80)
+
+
+@cli.command("from_nc_to_zarr")
+@click.argument("nc_path", type=click.Path(exists=True, dir_okay=False, file_okay=True))
+@click.option("--out", "out_zarr", type=click.Path(dir_okay=True, file_okay=False), default=None,
+              help="Output .zarr directory. Defaults to INPUT with .zarr extension.")
+@click.option("--overwrite/--no-overwrite", default=False, show_default=True,
+              help="If set, remove the output directory before writing.")
+@click.option("--consolidated/--no-consolidated", default=True, show_default=True,
+              help="Write consolidated metadata so xr.open_zarr can do a fast open.")
+@click.option("--preserve-source-chunks/--no-preserve-source-chunks",
+              default=True, show_default=True,
+              help="If set (default), open the netCDF with chunks={} so each "
+                   "dask chunk maps 1:1 to an HDF5 chunk in the source and to "
+                   "a single chunk-file in the output zarr.  This is the most "
+                   "faithful per-chunk mapping for filesystem dedup.  Pass "
+                   "--no-preserve-source-chunks to use xarray's chunks='auto' "
+                   "instead - only useful for netCDF-3 sources or contiguous "
+                   "netCDF-4 variables, where there is no native chunk "
+                   "geometry to preserve.")
+@click.option("--mask-and-scale/--no-mask-and-scale",
+              default=False, show_default=True,
+              help="Whether to apply CF mask_and_scale decoding (scale_factor, "
+                   "add_offset, _FillValue) at read time.  Default: OFF for "
+                   "this command (xarray's normal default is ON), because "
+                   "decoding promotes packed int8/int16 variables to float "
+                   "and quadruples their byte count, which confounds both the "
+                   "absolute-storage and the dedup-ratio numbers in the VAST "
+                   "experiment.  The encoding attrs ride along in var.attrs "
+                   "regardless, so a downstream reader that opens the output "
+                   "zarr with mask_and_scale=True (xarray's default) still "
+                   "gets the decoded floats - no information is lost, the "
+                   "values are just stored on disk in their packed form.")
+@click.option("--decode-times/--no-decode-times",
+              default=False, show_default=True,
+              help="Whether to apply CF time decoding (units like 'days since "
+                   "1970-01-01', calendar) at read time.  Default: OFF for "
+                   "this command (xarray's normal default is ON), for "
+                   "symmetry with --mask-and-scale: every on-disk numeric "
+                   "form is preserved regardless of what CF says it "
+                   "represents.  Effect on dedup is tiny (the time coord is "
+                   "usually a single 1-D array of a few KB), but flipping it "
+                   "off also sidesteps cftime/datetime64 round-trip variance "
+                   "across xarray versions for non-standard calendars.  "
+                   "Encoding attrs ride along in var.attrs, so a downstream "
+                   "reader passing decode_times=True (xarray default) still "
+                   "gets datetime64/cftime objects with no information loss.")
+@click.option("--threads", type=int, default=None,
+              help="Dask workers for parallel HDF5 reads + zarr writes. "
+                   "Default: auto-detected.")
+def from_nc_to_zarr(nc_path: str, out_zarr: str | None,
+                    overwrite: bool, consolidated: bool,
+                    preserve_source_chunks: bool,
+                    mask_and_scale: bool,
+                    decode_times: bool,
+                    threads: int | None):
+    """
+    Convert a NetCDF file (.nc) to a zarr v3 LocalStore (.zarr directory)
+    with NO compression, NO filters, and NO sharding.  Intended for
+    filesystem-level deduplication experiments (e.g. VAST FS).
+
+    Every data variable AND every coordinate is written with
+    `compressors=None, filters=None`; the only codec left in the pipeline
+    is the default bytes serializer, which is just an identity
+    dtype/endianness step (not a compressor).  Coordinate arrays are
+    explicitly included because lat/lon/time are usually identical across
+    the files in a series, and a default-compressed coord would mask the
+    dedup signal we're trying to measure on the storage side.
+
+    Sharding is intentionally NOT applied (zarr v3's default when no
+    `shards` key is passed): each chunk lands in its own file, so VAST
+    sees chunk-level granularity.  Sharding would bundle multiple chunks
+    per file with chunk offsets that depend on neighboring chunks, which
+    would degrade chunk-level dedup into FS-block-level dedup.
+
+    CF mask_and_scale decoding is OFF by default for this command
+    (xarray's normal default is ON).  Packed integer dtypes (int8/int16
+    with scale_factor/add_offset) stay packed on disk, which avoids both
+    the int->float byte-count quadrupling and the float-bit fragility
+    where two chunks with identical packed values could produce slightly
+    different decoded floats if scale_factor/add_offset attrs drift across
+    the file series.  Encoding attrs ride along in var.attrs, so a
+    downstream reader passing mask_and_scale=True (xarray default) still
+    gets the decoded floats with no information loss.
+
+    CF decode_times is OFF by default for the same family of reasons:
+    every on-disk numeric form is preserved regardless of what CF says it
+    represents.  Effect on dedup is tiny (time coords are typically a few
+    KB), but flipping it off also sidesteps cftime/datetime64 round-trip
+    variance across xarray versions for non-standard calendars.
+
+    Caveats
+    -------
+    - Any compression that was applied INSIDE the netCDF source file is
+      undone at read time by the netCDF reader.  We never see the on-disk
+      compressed bytes; we see the decoded array.  So "without any
+      compression" here means: nothing on the zarr write side, regardless
+      of how the netCDF was authored.
+    - With --preserve-source-chunks (default), the output zarr's chunk
+      structure mirrors the source's HDF5 chunk structure exactly.  For
+      netCDF-3 or contiguous netCDF-4 variables this still works (xarray
+      picks a single chunk covering the whole variable) but the per-chunk
+      dedup story becomes less interesting.
+    - --preserve-source-chunks gives chunk-level dedup ONLY when every
+      file in the series shares the same HDF5 chunk shape.  Differing
+      chunk shapes across the series would need a forced canonical
+      rechunk; not implemented here.
     """
     comm = MPI.COMM_WORLD
     rank = comm.Get_rank()
     size = comm.Get_size()
     if size > 1:
         if rank == 0:
-            click.echo("This command is not meant to be run in parallel. Please run it with a single process.")
+            click.echo("from_nc_to_zarr is not meant to run in parallel.")
+        # Collective abort: sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
+
+    if Path(nc_path).suffix.lower() != ".nc":
+        click.echo(
+            f"Expected a .nc file, got {nc_path}.  This command only "
+            f"handles netCDF input; use from_zarr_to_netcdf for the "
+            f"reverse direction."
+        )
         sys.exit(1)
 
-    zarr_group, store = utils.open_zarr_zipstore(zarr_zip_file)
+    if out_zarr is None:
+        out_zarr = str(Path(nc_path).with_suffix(".zarr"))
 
-    click.echo(zarr_group.tree())
+    out_path = Path(out_zarr)
+    if out_path.exists():
+        if overwrite:
+            import shutil
+            click.echo(f"[nc->zarr] removing existing {out_zarr} (--overwrite).")
+            shutil.rmtree(out_zarr)
+        else:
+            click.echo(
+                f"Output already exists: {out_zarr}.  "
+                f"Pass --overwrite to replace, or pick a different --out."
+            )
+            sys.exit(1)
 
-    click.echo(80* "-")
-    for array_name in zarr_group.array_keys():
-        click.echo(f"Array: {array_name}")
-        click.echo(zarr_group[array_name].info_complete())
-        click.echo(zarr_group[array_name][:])
-        click.echo(80* "-")
+    if threads is None:
+        threads = utils.detect_cores_available()
+
+    click.echo(f"[nc->zarr] reading {nc_path} ...")
+    # chunks={} -> dask chunks track HDF5 chunks 1:1 (the default for this
+    # command).  chunks="auto" -> dask picks a chunking, used as a fallback
+    # for non-chunked sources.  We never use chunks=None because that would
+    # eagerly materialise the whole field in RAM, and there's no need: we
+    # always want lazy reads paired with the streaming to_zarr write.
+    chunks = {} if preserve_source_chunks else "auto"
+    # mask_and_scale=False keeps packed int dtypes packed; decode_times=False
+    # keeps time coords as raw numerics.  See the docstring and the option
+    # help text for why these are the dedup-friendly defaults.
+    with dask.config.set(scheduler="threads", num_workers=int(threads)):
+        ds = xr.open_dataset(
+            nc_path,
+            chunks=chunks,
+            mask_and_scale=mask_and_scale,
+            decode_times=decode_times,
+        )
+        logical_bytes = int(ds.nbytes)
+        click.echo(
+            f"[nc->zarr] logical size = {humanize.naturalsize(logical_bytes, binary=True)} "
+            f"| chunks = {'source-native' if preserve_source_chunks else 'auto'} "
+            f"| mask_and_scale = {mask_and_scale} "
+            f"| decode_times = {decode_times} "
+            f"| dask workers = {threads}"
+        )
 
-    store.close()
+        # Per-variable encoding override.  Two layers of defense:
+        # 1. Clear .encoding on every variable so any netCDF-side encoding keys
+        #    (zlib, shuffle, chunksizes, _FillValue, ...) inherited from
+        #    xr.open_dataset don't leak into xarray's encoding-translation layer.
+        # 2. Pass an explicit `compressors=None, filters=None` per variable to
+        #    `to_zarr`, which wins over anything still residual.
+        # We iterate over ds.variables (data_vars + coords) so coordinate arrays
+        # are included; see the docstring for why.
+        encoding = {}
+        for name in ds.variables:
+            ds[name].encoding = {}
+            encoding[name] = {
+                "compressors": None,
+                "filters": None,
+            }
+
+        click.echo(
+            f"[nc->zarr] writing {out_zarr} (compressors=None, filters=None, "
+            f"{len(encoding)} variable(s)) ..."
+        )
+        # mode="w-" = create-only; we already short-circuited on the
+        # exists-and-not-overwrite path above, so this just guards against a
+        # race with another process between the check and the write.
+        ds.to_zarr(
+            out_zarr,
+            mode="w-",
+            encoding=encoding,
+            zarr_format=3,
+            consolidated=consolidated,
+        )
+    click.echo(f"[nc->zarr] wrote {out_zarr}")
 
 
-@cli.command("from_zarr_zip_to_netcdf")
-@click.argument("zarr_zip_file", type=click.Path(exists=True, dir_okay=False))
+@cli.command("from_zarr_to_netcdf")
+@click.argument("zarr_path", type=click.Path(exists=True, dir_okay=True, file_okay=False))
 @click.option("--out", "out_nc", type=click.Path(dir_okay=False), default=None,
               help="Output NetCDF file. Defaults to INPUT with .nc extension.")
-def from_zarr_zip_to_netcdf(zarr_zip_file: str, out_nc: str | None):
+@click.option("--max-size", default="50GB", callback=_size_option_callback,
+              show_default=True,
+              help="Refuse to write if logical output would exceed this size. "
+                   "NetCDF4 is not a great container for very large data; "
+                   "for >50GB consider keeping the .zarr as-is.")
+@click.option("--compression", default="zlib", show_default=True,
+              help="NetCDF variable compression (zlib/none).")
+@click.option("--complevel", default=4, show_default=True, help="zlib compression level.")
+@click.option("--threads", type=int, default=None,
+              help="Dask workers for parallel zarr reads + netCDF writes. "
+                   "Default: auto-detected.")
+@click.option("--codec-threads", type=int, default=1, show_default=True,
+              help="Internal threads per codec call (Blosc decode is set "
+                   "live; for OpenMP/MKL/OpenBLAS export the matching env "
+                   "vars in the shell BEFORE running). --threads * "
+                   "--codec-threads must be <= physical cores.")
+def from_zarr_to_netcdf(zarr_path: str, out_nc: str | None,
+                        max_size: int, compression: str, complevel: int,
+                        threads: int | None, codec_threads: int):
     """
-    Convert a Zarr Zipped file to netcdf.
-
-    \b
-    Args:
-        zarr_zip_file (str): Path to the Zarr file.
-        out_nc (str): Output NetCDF file.
+    Convert a zarr v3 LocalStore (.zarr directory) to a NetCDF4 file.
+    Writes are streamed via dask so the full dataset is never held in memory.
     """
     comm = MPI.COMM_WORLD
     rank = comm.Get_rank()
     size = comm.Get_size()
     if size > 1:
         if rank == 0:
-            click.echo("This command is not meant to be run in parallel. Please run it with a single process.")
-        sys.exit(1)
+            click.echo("from_zarr_to_netcdf is not meant to run in parallel.")
+        # Collective abort: sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
 
     if out_nc is None:
-        out_nc = os.path.splitext(zarr_zip_file)[0] + ".nc"
+        out_nc = str(Path(zarr_path).with_suffix(".nc"))
+
+    if threads is None:
+        threads = utils.detect_cores_available()
+    _apply_codec_threads(codec_threads)
+    _check_thread_product(threads, codec_threads)
+
+    # Load via xarray; this preserves dims/coords if consolidated metadata exists.
+    # The previous heuristic (Path(zarr_path)/"zarr.json").exists() was wrong:
+    # every zarr v3 store has a zarr.json, consolidated or not.  Consolidation
+    # in v3 is a `consolidated_metadata` field *inside* that zarr.json.  We try
+    # consolidated first (fast path) and fall back to a metadata scan if the
+    # store wasn't processed by `merge_compressed_fields`.
+    with dask.config.set(scheduler="threads", num_workers=int(threads)):
+        try:
+            ds = xr.open_zarr(zarr_path, chunks="auto", consolidated=True)
+        except Exception:
+            ds = xr.open_zarr(zarr_path, chunks="auto", consolidated=False)
+
+        logical_bytes = int(ds.nbytes)
+        click.echo(
+            f"[zarr->nc] logical size = "
+            f"{humanize.naturalsize(logical_bytes, binary=True)} "
+            f"| dask workers = {threads} | codec-threads = {codec_threads}"
+        )
+        if logical_bytes > max_size:
+            click.echo(
+                f"Refusing to write: logical size exceeds --max-size "
+                f"({humanize.naturalsize(max_size, binary=True)}). "
+                f"Raise --max-size to proceed, or keep the data in .zarr."
+            )
+            sys.exit(1)
 
-    zgroup, store = utils.open_zarr_zipstore(zarr_zip_file)
-    try:
-        names = list(zgroup.array_keys())
-        if not names:
-            raise click.ClickException("No arrays found in the Zarr store.")
-        ds = xr.Dataset({
-            n: xr.DataArray(dask.array.from_zarr(zgroup[n]),
-                            dims=[f"{n}_d{i}" for i in range(zgroup[n].ndim)],
-                            name=n)
-            for n in names
-        })
-        ds.to_netcdf(out_nc, engine="h5netcdf")
-        click.echo(f"Wrote NetCDF: {out_nc}")
-    finally:
-        store.close()
+        # Per-variable encoding: preserve dask chunks as NetCDF chunks, add compression.
+        encoding = {}
+        for name, var in ds.data_vars.items():
+            enc = {}
+            if isinstance(var.data, dask.array.Array):
+                # Use one dask chunk per netcdf chunk; max(b) protects against
+                # rechunks that produce a smaller leading block.
+                enc["chunksizes"] = tuple(max(b) for b in var.data.chunks)
+            if compression == "zlib":
+                enc["zlib"] = True
+                enc["complevel"] = int(complevel)
+            encoding[name] = enc
+
+        click.echo(f"[zarr->nc] writing {out_nc} ...")
+        ds.to_netcdf(out_nc, engine="h5netcdf", encoding=encoding)
+    click.echo(f"[zarr->nc] wrote {out_nc}")
 
 
 @cli.command("perform_clustering")
@@ -502,6 +3105,15 @@ def perform_clustering(npy_file: str, l_error: str):
         npy_file (str): npy file with L-errors and compression ratios results for each combination of compressor, filter, and serializer
         l_error (str): choose between "L1", "L2", "LInf" to generate the plot
     """
+    # Lazy imports: kept out of the module top-level so `evaluate_combos` /
+    # `compress_with_optimal` don't pay the matplotlib+sklearn+tqdm import
+    # cost on every invocation.  See the comment block near the top of this
+    # file for the rationale.
+    from tqdm import tqdm
+    from sklearn.cluster import KMeans
+    from sklearn.metrics import silhouette_score
+    import matplotlib.pyplot as plt
+
     scored_results = np.load(npy_file, allow_pickle=True)
 
     scored_results_pd = pd.DataFrame(scored_results)
@@ -543,7 +3155,18 @@ def perform_clustering(npy_file: str, l_error: str):
 
 @cli.command("analyze_clustering")
 @click.argument("npy_file", type=click.Path(exists=True, dir_okay=False))
-def analyze_clustering(npy_file: str):
+@click.option("--where-to-write", "where_to_write", required=True,
+              type=click.Path(exists=True, dir_okay=True, file_okay=False),
+              help="Directory containing the `config_space_{var}.csv` written by "
+                   "evaluate_combos.  Must be the same directory passed as "
+                   "--where-to-write to evaluate_combos for this run.")
+@click.option("--var", "var", required=True, type=str,
+              help="Variable (field) name to analyse.  Must match the `var` used "
+                   "in the evaluate_combos run that produced the .npy and the "
+                   "config_space_{var}.csv file (so for a field named 't' it's "
+                   "`--var t`, and the tool will read "
+                   "`{where_to_write}/config_space_t.csv`).")
+def analyze_clustering(npy_file: str, where_to_write: str, var: str):
     """
     Performs clustering on all 3 L-errors, can be executed only after evaluate_combos.
     It can be executed only after evaluate_combos.
@@ -556,8 +3179,35 @@ def analyze_clustering(npy_file: str):
     \b
     Args:
         npy_file (str): npy file with L-errors and compression ratios results for each combination of compressor, filter, and serializer
+        where_to_write (str): --where-to-write
+        var (str): --var
     """
-    config_idxs = pd.read_csv("config_space.csv")
+    # Lazy imports: see the comment near the top of this file.
+    from sklearn.cluster import KMeans
+    import plotly.io as pio
+    import plotly.express as px
+    import plotly.graph_objects as go
+    from plotly.subplots import make_subplots
+
+    # evaluate_combos now writes config_space_{var}.csv into {where_to_write}
+    # (renamed from the old cwd-relative `config_space.csv`).  Resolve it
+    # explicitly from the required flags so analyze_clustering can be run
+    # from any working directory.  Both flags are required — no magic
+    # fallback to cwd — because guessing would reintroduce the same
+    # footgun: the old `pd.read_csv("config_space.csv")` silently picked
+    # up whatever happened to be in cwd (possibly a stale file from a
+    # different run).
+    config_csv_path = Path(where_to_write) / f"config_space_{var}.csv"
+    if not config_csv_path.is_file():
+        raise click.FileError(
+            str(config_csv_path),
+            hint=(
+                f"Expected `config_space_{var}.csv` in {where_to_write}. "
+                f"Run `dc_toolkit evaluate_combos ... --where-to-write {where_to_write}` "
+                f"first, and confirm --var matches the field-to-compress used there."
+            ),
+        )
+    config_idxs = pd.read_csv(config_csv_path)
     scored_results = np.load(str(npy_file), allow_pickle=True)
 
     scored_results_pd = pd.DataFrame(scored_results)
@@ -619,79 +3269,79 @@ def analyze_clustering(npy_file: str):
     for trace in fig_l1.data:
         fig.add_trace(trace, row=1, col=1)
 
-        # L2 clustering
-        clean_arr_l2_filtered = np.column_stack((clean_arr_l2[:, 0].astype(float), clean_arr_l2[:, 1].astype(float)))
-
-        df_l2 = pd.DataFrame(clean_arr_l2_filtered, columns=["Ratio", "L2"])
-        df_l2["compressor"] = clean_arr_l2[:, 2]
-        df_l2["filter"] = clean_arr_l2[:, 3]
-        df_l2["serializer"] = clean_arr_l2[:, 4]
-        df_l2["compressor_idx"] = utils.get_indexes(clean_arr_l2[:, 2], config_idxs['0'])
-        df_l2["filter_idx"] = utils.get_indexes(clean_arr_l2[:, 3], config_idxs['1'])
-        df_l2["serializer_idx"] = utils.get_indexes(clean_arr_l2[:, 4], config_idxs['2'])
-
-        y_kmeans = kmeans.fit_predict(pd.DataFrame(df_l2, columns=["Ratio", "L2"]))
-        color = np.ones(y_kmeans.shape) if len(np.unique(y_kmeans)) == 1 else y_kmeans
-
-        fig_l2 = px.scatter(df_l2, x="Ratio", y="L2", color=color,
-                            title="L2 VS Ratio KMeans Clustering",
-                            hover_data=["compressor", "filter", "serializer", "compressor_idx", "filter_idx", "serializer_idx"])
-
-        fig.add_trace(
-            go.Scatter(
-                x=kmeans.cluster_centers_[:, 0],
-                y=kmeans.cluster_centers_[:, 1],
-                mode="markers+text",
-                marker=dict(color="black", size=12, symbol="x"),
-                textposition="top center",
-                name="Centroids",
-                showlegend=False
-            ),
-            row=2,
-            col=1
-        )
-        fig.update_xaxes(title_text="Ratio", row=2, col=1)
-        fig.update_yaxes(title_text="L2", row=2, col=1)
-        for trace in fig_l2.data:
-            fig.add_trace(trace, row=2, col=1)
-
-        # LInf clustering
-        clean_arr_linf_filtered = np.column_stack(
-            (clean_arr_linf[:, 0].astype(float), clean_arr_linf[:, 1].astype(float)))
-
-        df_linf = pd.DataFrame(clean_arr_linf_filtered, columns=["Ratio", "LInf"])
-        df_linf["compressor"] = clean_arr_linf[:, 2]
-        df_linf["filter"] = clean_arr_linf[:, 3]
-        df_linf["serializer"] = clean_arr_linf[:, 4]
-        df_linf["compressor_idx"] = utils.get_indexes(clean_arr_linf[:, 2], config_idxs['0'])
-        df_linf["filter_idx"] = utils.get_indexes(clean_arr_linf[:, 3], config_idxs['1'])
-        df_linf["serializer_idx"] = utils.get_indexes(clean_arr_linf[:, 4], config_idxs['2'])
-
-        y_kmeans = kmeans.fit_predict(pd.DataFrame(df_linf, columns=["Ratio", "LInf"]))
-        color = np.ones(y_kmeans.shape) if len(np.unique(y_kmeans)) == 1 else y_kmeans
-
-        fig_linf = px.scatter(df_linf, x="Ratio", y="LInf", color=color,
-                              title="LInf VS Ratio KMeans Clustering",
-                              hover_data=["compressor", "filter", "serializer", "compressor_idx", "filter_idx",
-                                          "serializer_idx"])
-
-        fig.add_trace(
-            go.Scatter(
-                x=kmeans.cluster_centers_[:, 0],
-                y=kmeans.cluster_centers_[:, 1],
-                mode="markers+text",
-                marker=dict(color="black", size=12, symbol="x"),
-                textposition="top center",
-                name="Centroids",
-                showlegend=False
-            ),
-            row=3,
-            col=1
-        )
-        fig.update_xaxes(title_text="Ratio", row=3, col=1)
-        fig.update_yaxes(title_text="LInf", row=3, col=1)
-        for trace in fig_linf.data:
-            fig.add_trace(trace, row=3, col=1)
+    # L2 clustering
+    clean_arr_l2_filtered = np.column_stack((clean_arr_l2[:, 0].astype(float), clean_arr_l2[:, 1].astype(float)))
+
+    df_l2 = pd.DataFrame(clean_arr_l2_filtered, columns=["Ratio", "L2"])
+    df_l2["compressor"] = clean_arr_l2[:, 2]
+    df_l2["filter"] = clean_arr_l2[:, 3]
+    df_l2["serializer"] = clean_arr_l2[:, 4]
+    df_l2["compressor_idx"] = utils.get_indexes(clean_arr_l2[:, 2], config_idxs['0'])
+    df_l2["filter_idx"] = utils.get_indexes(clean_arr_l2[:, 3], config_idxs['1'])
+    df_l2["serializer_idx"] = utils.get_indexes(clean_arr_l2[:, 4], config_idxs['2'])
+
+    y_kmeans = kmeans.fit_predict(pd.DataFrame(df_l2, columns=["Ratio", "L2"]))
+    color = np.ones(y_kmeans.shape) if len(np.unique(y_kmeans)) == 1 else y_kmeans
+
+    fig_l2 = px.scatter(df_l2, x="Ratio", y="L2", color=color,
+                        title="L2 VS Ratio KMeans Clustering",
+                        hover_data=["compressor", "filter", "serializer", "compressor_idx", "filter_idx", "serializer_idx"])
+
+    fig.add_trace(
+        go.Scatter(
+            x=kmeans.cluster_centers_[:, 0],
+            y=kmeans.cluster_centers_[:, 1],
+            mode="markers+text",
+            marker=dict(color="black", size=12, symbol="x"),
+            textposition="top center",
+            name="Centroids",
+            showlegend=False
+        ),
+        row=2,
+        col=1
+    )
+    fig.update_xaxes(title_text="Ratio", row=2, col=1)
+    fig.update_yaxes(title_text="L2", row=2, col=1)
+    for trace in fig_l2.data:
+        fig.add_trace(trace, row=2, col=1)
+
+    # LInf clustering
+    clean_arr_linf_filtered = np.column_stack(
+        (clean_arr_linf[:, 0].astype(float), clean_arr_linf[:, 1].astype(float)))
+
+    df_linf = pd.DataFrame(clean_arr_linf_filtered, columns=["Ratio", "LInf"])
+    df_linf["compressor"] = clean_arr_linf[:, 2]
+    df_linf["filter"] = clean_arr_linf[:, 3]
+    df_linf["serializer"] = clean_arr_linf[:, 4]
+    df_linf["compressor_idx"] = utils.get_indexes(clean_arr_linf[:, 2], config_idxs['0'])
+    df_linf["filter_idx"] = utils.get_indexes(clean_arr_linf[:, 3], config_idxs['1'])
+    df_linf["serializer_idx"] = utils.get_indexes(clean_arr_linf[:, 4], config_idxs['2'])
+
+    y_kmeans = kmeans.fit_predict(pd.DataFrame(df_linf, columns=["Ratio", "LInf"]))
+    color = np.ones(y_kmeans.shape) if len(np.unique(y_kmeans)) == 1 else y_kmeans
+
+    fig_linf = px.scatter(df_linf, x="Ratio", y="LInf", color=color,
+                          title="LInf VS Ratio KMeans Clustering",
+                          hover_data=["compressor", "filter", "serializer", "compressor_idx", "filter_idx",
+                                      "serializer_idx"])
+
+    fig.add_trace(
+        go.Scatter(
+            x=kmeans.cluster_centers_[:, 0],
+            y=kmeans.cluster_centers_[:, 1],
+            mode="markers+text",
+            marker=dict(color="black", size=12, symbol="x"),
+            textposition="top center",
+            name="Centroids",
+            showlegend=False
+        ),
+        row=3,
+        col=1
+    )
+    fig.update_xaxes(title_text="Ratio", row=3, col=1)
+    fig.update_yaxes(title_text="LInf", row=3, col=1)
+    for trace in fig_linf.data:
+        fig.add_trace(trace, row=3, col=1)
 
     fig.update_layout(
         title="",
@@ -715,12 +3365,12 @@ def analyze_clustering(npy_file: str):
 @click.option("--filter-class", default="all", help="Same as in evaluate_combos.")
 @click.option("--serializer-class", default="all", help="Same as in evaluate_combos.")
 @click.option("--with-lossy/--without-lossy", default=True, show_default=True, help="Same as in evaluate_combos.")
-@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=True, show_default=True, help="Same as in evaluate_combos.")
-@click.option("--with-ebcc/--without-ebcc", default=True, show_default=True, help="Same as in evaluate_combos.")
+@click.option("--with-numcodecs-wasm/--without-numcodecs-wasm", default=False, show_default=True, help="Same as in evaluate_combos.")
+@click.option("--with-ebcc/--without-ebcc", default=False, show_default=True, help="Same as in evaluate_combos.")
 def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_compress: str,
                             comp_idx: int, filt_idx: int, ser_idx: int, 
                             compressor_class: str = "all", filter_class: str = "all", serializer_class: str = "all",
-                            with_lossy: bool = True, with_numcodecs_wasm: bool = True, with_ebcc: bool = True):
+                            with_lossy: bool = True, with_numcodecs_wasm: bool = False, with_ebcc: bool = False):
     """
     Plot the absolute errors arising from compression+decompression of a field
     with the desired combination of compressor, filter, and serializer.
@@ -754,6 +3404,8 @@ def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_com
         with_numcodecs_wasm: --with-numcodecs-wasm/--without-numcodecs-wasm
         with_ebcc: --with-ebcc/--without-ebcc
     """
+    # Lazy import: see the comment near the top of this file.
+    import matplotlib.pyplot as plt
 
     #############
     # GET COMBO #
@@ -765,7 +3417,9 @@ def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_com
     if size > 1:
         if rank == 0:
             click.echo("This command is not meant to be run in parallel. Please run it with a single process.")
-        sys.exit(1)
+        # Collective abort: sys.exit on rank 0 alone would leave ranks 1..N
+        # blocking at the next collective.
+        comm.Abort(1)
 
     os.makedirs(where_to_write, exist_ok=True)
 
@@ -777,13 +3431,13 @@ def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_com
 
     if not utils.is_lat_lon(da):
         click.echo(f"Field {field_to_compress} should be in lat-lon form, i.e. dimensions (lat, lon)! It currently has dimensions: {da.dims}.")
-        sys.exit(1)
+        comm.Abort(1)
 
     mem_threshold = 2.5  # GiB
     if da_memsize / (1024 ** 3) > mem_threshold:
         click.echo(f"Field {field_to_compress} is too large ({humanize.naturalsize(da_memsize, binary=True)}). "
                    f"To avoid high memory usage we only support fields up to {mem_threshold} GiB.")
-        sys.exit(1)
+        comm.Abort(1)
 
     compressors = utils.compressor_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, compressor_class)
     filters = utils.filter_space(da, with_lossy, with_numcodecs_wasm, with_ebcc, filter_class)
@@ -793,17 +3447,17 @@ def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_com
         pass
     else:
         click.echo(f"Invalid comp_idx: {comp_idx}")
-        sys.exit(1)
+        comm.Abort(1)
     if -1 <= filt_idx < len(filters):
         pass
     else:
         click.echo(f"Invalid filt_idx: {filt_idx}")
-        sys.exit(1)
+        comm.Abort(1)
     if -1 <= ser_idx < len(serializers):
         pass
     else:
         click.echo(f"Invalid ser_idx: {ser_idx}")
-        sys.exit(1)
+        comm.Abort(1)
 
     selected_compressor = compressors[comp_idx][1] if comp_idx != -1 else None
     selected_filter = filters[filt_idx][1] if filt_idx != -1 else None
@@ -844,8 +3498,13 @@ def plot_compression_errors(dataset_file: str, where_to_write: str, field_to_com
 
     # Flatten data for ZFPY serializer
     if isinstance(selected_serializer, numcodecs.zarr3.ZFPY):
-        da = da.stack(flat_dim=da.dims)
-        shifted_da = shifted_da.stack(flat_dim=da.dims)
+        # Save the original dims BEFORE mutating `da`.  Using `da.dims` on the
+        # second stack call after the first one runs would read the stacked
+        # shape (`("flat_dim",)`), so xarray would try to stack `shifted_da`
+        # on a dimension it doesn't have and raise ValueError.
+        orig_dims = da.dims
+        da = da.stack(flat_dim=orig_dims)
+        shifted_da = shifted_da.stack(flat_dim=orig_dims)
 
     ############
     # COMPRESS #
diff --git a/src/dc_toolkit/compression_analysis_ui_local.py b/src/dc_toolkit/compression_analysis_ui_local.py
index 70bc23d..da114f2 100644
--- a/src/dc_toolkit/compression_analysis_ui_local.py
+++ b/src/dc_toolkit/compression_analysis_ui_local.py
@@ -36,7 +36,7 @@
 import zipfile
 
 def load_scored_results(file_name: str, params_str: list[str]):
-    return np.load(file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
+    return np.load("./out/" + file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
 
 def create_cluster_plots(clean_arr_l1, clean_arr_l2, clean_arr_linf, n_clusters):
     config_idxs = pd.read_csv("config_space.csv")
@@ -419,15 +419,18 @@ def analyze_compressors(self):
         with_options_ls.append("--with-numcodecs-wasm") if self.options_numcodecs_wasm.currentText() == "with" else with_options_ls.append("--without-numcodecs-wasm")
         with_options_ls.append("--with-ebcc") if self.options_ebcc.currentText() == "with" else with_options_ls.append("--without-ebcc")
 
+        # create ./out dir if it doesn't exist, to place all generated files there
+        if not os.path.exists("out"):
+            os.makedirs("out")
         if self.predefined_l1.isChecked():
             cmd = [
                 "mpirun",
                 "-n",
-                "8",
+                "1",
                 "dc_toolkit",
                 "evaluate_combos",
                 self.modified_file_path,
-                os.getcwd(),
+                "--where-to-write=out",
                 "--field-to-compress=" + selected_var,
                 "--compressor-class=" + compressor_class,
                 "--filter-class=" + filter_class,
diff --git a/src/dc_toolkit/compression_analysis_ui_vcluster.py b/src/dc_toolkit/compression_analysis_ui_vcluster.py
index b94d82f..c15c1db 100644
--- a/src/dc_toolkit/compression_analysis_ui_vcluster.py
+++ b/src/dc_toolkit/compression_analysis_ui_vcluster.py
@@ -75,7 +75,7 @@ def find_file(base_path, file_name):
 
 @st.cache_data
 def load_scored_results(file_name: str, params_str: list[str]):
-    return np.load(file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
+    return np.load("./out/" + file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
 
 
 @st.cache_resource
@@ -312,6 +312,10 @@ def load_and_resize_netcdf(file_content, original_name, max_size_bytes=1e7):
     with_lossy_option = "--with-lossy" if lossy_class == "with" else "--without-lossy"
     with_numcodecs_option = "--with-numcodecs-wasm" if numcodecs_wasm_class == "with" else "--without-numcodecs-wasm"
     with_ebcc_option = "--with-ebcc" if ebcc_class == "with" else "--without-ebcc"
+
+    # create ./out dir if it doesn't exist, to place all generated files there
+    if not os.path.exists("out"):
+        os.makedirs("out")
     if st.button("Analyze compressors"):
         if predefined_l1:
             cmd_compress = [
@@ -326,7 +330,7 @@ def load_and_resize_netcdf(file_content, original_name, max_size_bytes=1e7):
                 "dc_toolkit",
                 "evaluate_combos",
                 display_file_name,
-                os.getcwd(),
+                "--where-to-write=out",
                 "--field-to-compress=" + field_to_compress,
                 "--compressor-class=" + compressor_class,
                 "--filter-class=" + filter_class,
@@ -346,7 +350,7 @@ def load_and_resize_netcdf(file_content, original_name, max_size_bytes=1e7):
                 "dc_toolkit",
                 "evaluate_combos",
                 parse_args().uploaded_file,
-                os.getcwd(),
+                "--where-to-write=out",
                 "--field-to-compress=" + field_to_compress,
                 "--compressor-class=" + compressor_class,
                 "--filter-class=" + filter_class,
diff --git a/src/dc_toolkit/compression_analysis_ui_web.py b/src/dc_toolkit/compression_analysis_ui_web.py
index 7c93c79..f3b53f6 100644
--- a/src/dc_toolkit/compression_analysis_ui_web.py
+++ b/src/dc_toolkit/compression_analysis_ui_web.py
@@ -70,7 +70,7 @@ def find_file(base_path, file_name):
 
 @st.cache_data
 def load_scored_results(file_name: str, params_str: list[str]):
-    return np.load(file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
+    return np.load("./out/" + file_name + params_str + "_scored_results_with_names.npy", allow_pickle=True)
 
 
 @st.cache_resource
@@ -301,16 +301,20 @@ def load_and_resize_netcdf(file_content, original_name, max_size_bytes=1e7):
     with_lossy_option = "--with-lossy" if lossy_class == "with" else "--without-lossy"
     with_numcodecs_option = "--with-numcodecs-wasm" if numcodecs_wasm_class == "with" else "--without-numcodecs-wasm"
     with_ebcc_option = "--with-ebcc" if ebcc_class == "with" else "--without-ebcc"
+
+    # create ./out dir if it doesn't exist, to place all generated files there
+    if not os.path.exists("out"):
+        os.makedirs("out")
     if st.button("Analyze compressors"):
         if predefined_l1:
             cmd_compress = [
                 "mpirun",
                 "-n",
-                "8",
+                "1",
                 "dc_toolkit",
                 "evaluate_combos",
                 tmp.name,
-                os.getcwd(),
+                "--where-to-write=out",
                 "--field-to-compress="+field_to_compress,
                 "--compressor-class="+compressor_class,
                 "--filter-class="+filter_class,
@@ -321,11 +325,11 @@ def load_and_resize_netcdf(file_content, original_name, max_size_bytes=1e7):
             cmd_compress = [
                 "mpirun",
                 "-n",
-                "8",
+                "1",
                 "dc_toolkit",
                 "evaluate_combos",
                 tmp.name,
-                os.getcwd(),
+                "--where-to-write=out",
                 "--field-to-compress=" + field_to_compress,
                 "--compressor-class=" + compressor_class,
                 "--filter-class=" + filter_class,
diff --git a/src/dc_toolkit/data/l1_error_thresholds.csv b/src/dc_toolkit/data/l1_error_thresholds.csv
new file mode 100644
index 0000000..38e5293
--- /dev/null
+++ b/src/dc_toolkit/data/l1_error_thresholds.csv
@@ -0,0 +1,245 @@
+Field Name,Short Name,grib paramID,Unit,Level,Minimum value,Maximum value,Existing L1 error,Distribution,Absolute L1 error,Absolute L2 error,Absolute Linfty error,Relative L1 error,Relative L2 error,Relative Linfty error,decimal error,Important for budget,Additional Requirements,Comments
+Potential vorticity,pv,60,K m**2 kg**-1 s**-1,Pressure+Model,"-0,00028479728","0,00017763405","0,000000007056142337",Other,,,"1,00E-07",,,"0,1",,Energy,,"0,1 PVU = 1e-7"
+Specific rain water content,crwc,75,kg kg**-1,Pressure+Model,0,"0,001519829","0,00000002319075065",Other,,,,,,"0,01",,Water,,
+Specific snow water content,cswc,76,kg kg**-1,Pressure+Model,0,"0,0042871237","0,00000006541631592",,,,,,,"0,01",,Water,,
+Geopotential,z,129,m**2 s**-2,Pressure+Model,"-5969,504","208893,06","3,278542519",Normal,5,,10,,,,,,,
+Temperature,t,130,K,Pressure+Model,"179,20174","325,61295","0,002234057756",Normal,,,"0,05",,,"0,01",,,,Peter B. recommends 0.01K for T fields for long integrations to make budgets
+U component of wind,u,131,m s**-1,Pressure+Model,"-62,229965","91,23659","0,002341713756",Normal,,,"0,5",,,,,,"calm is defined below 0,5 m/s","Peter B. and Jean suggest to go lower, e.g. 0,01m/s or even 0.001m/s for wind fields"
+V component of wind,v,132,m s**-1,Pressure+Model,"-67,543396","69,395035","0,00208951463",Normal,,,"0,5",,,,,,,"Peter B. and Jean suggest to go lower, e.g. 0,01m/s or even 0.001m/s for wind fields"
+Specific humidity,q,133,kg kg**-1,Pressure+Model,"-0,000022094682","0,030683426","0,0000004685290662",Log-Normal,,,,,,"0,01",,Water,,
+Vertical velocity,w,135,Pa s**-1,Pressure+Model,"-7,017746","8,326787","0,0002341389918",Normal,,,"0,01",,,"0,05",,,,
+Vorticity (relative),vo,138,s**-1,Pressure+Model,"-0,00080912816","0,0012993631","0,00000003217302336",,,,,,,"0,05",,,,
+Divergence,d,155,s**-1,Pressure+Model,"-0,00042861514","0,00043540553","0,00000001318390908",,,,,,,"0,05",,,,
+Relative humidity,r,157,%,Pressure+Model,"-5,4819574","160,14786","0,002527310513",Log-Normal,,,,,,"0,01",,Energy,,
+Ozone mass mixing ratio,o3,203,kg kg**-1,Pressure+Model,"-0,0000000030346343","0,000008376514","0,0000000001278617767",Normal,,,,,,"0,01",,,,Johannes Flemming
+Specific cloud liquid water content,clwc,246,kg kg**-1,Pressure+Model,0,"0,0029973984","0,000000731786713",,,,,,,"0,01",,Water,,
+Specific cloud ice water content,ciwc,247,kg kg**-1,Pressure+Model,0,"0,0005021095","0,0000001225853339",,,,,,,"0,01",,Water,,
+Fraction of cloud cover,cc,248,dimensionless,Pressure+Model,0,1,"0,00390625",Normal,,,,,,"0,01",,,,Richard Forbes
+Surface runoff,sro,8,m,Single/Surface,0,"0,040673256","0,0000006206246326",Other,"0,00001",,,"0,01",,,,Water,,"The maximum error should be the smaller of either indicated absolute/relative errors
+Once this is implemented, please communicate with Jasper Denissen/Michel Wortmann on the issues they solved for runoff"
+Sub-surface runoff,ssro,9,m,Single/Surface,0,"0,0268569","0,0000004098037607",Other,"0,00001",,,"0,01",,,,Water,,The maximum error should be the smaller of either indicated absolute/relative errors
+UV visible albedo for direct radiation (climatological),aluvp,15,dimensionless,Single/Surface,"0,011121053","0,91759235","0,00001383165454",Normal,"0,01",,,,,,,,,There is no point in archiving these four components because they are straight from a climatology and are likely to be misinterpreted by users
+UV visible albedo for diffuse radiation (climatological),aluvd,16,dimensionless,Single/Surface,"0,012219813","0,9175915","0,00001381487618",Normal,"0,01",,,,,,,,,
+Near IR albedo for direct radiation (climatological),alnip,17,dimensionless,Single/Surface,"0,059094552","0,66608745","0,000009261976629",Normal,"0,01",,,,,,,,,
+Near IR albedo for diffuse radiation (climatological),alnid,18,dimensionless,Single/Surface,"0,059999995","0,6812921","0,000009480165318",Normal,"0,01",,,,,,,,,
+Lake cover,cl,26,dimensionless,Single/Surface,0,1,"0,00000005960464478",Other,"0,00001",,,"0,01",,,,Energy,,Margarita Choulga
+Low vegetation cover,cvl,27,dimensionless,Single/Surface,0,1,"0,00000005960464478",Other,"0,00001",,,"0,01",,,,Energy,,
+High vegetation cover,cvh,28,dimensionless,Single/Surface,0,1,"0,00000005960464478",Other,"0,00001",,,"0,01",,,,Energy,,
+Type of low vegetation,tvl,29,(Code table 4.234),Single/Surface,0,17,"0,0002593994141",Other,,,,,,,,,,The vegetation type values are integers - there shouldn't be an error associated with them.
+Type of high vegetation,tvh,30,(Code table 4.234),Single/Surface,0,19,"0,0002899169922",Other,,,,,,,,,,The vegetation type values are integers - there shouldn't be an error associated with them.
+Sea ice area fraction,ci,31,dimensionless,Single/Surface,0,1,"0,00001525878906",Other,"0,05",,,"0,01",,"0,05",,,,
+Snow albedo,asn,32,dimensionless,Single/Surface,"0,5199999","0,8800079","0,000005493286153",Normal,"0,001",,,"0,001",,,,Energy,Both Energy and water budget,Ask Gabriele Arduini
+Snow density,rsn,33,kg m**-3,Single/Surface,"99,999985",450,"0,005340576172",Normal,"0,1",,,"0,01",,,,,,
+Sea surface temperature,sst,34,K,Single/Surface,"267,95508","311,38647","0,0006627105176",Normal,"0,01",,"0,01",,,,,,,Peter B. recommends 0.01K for T fields for long integrations to make budgets
+Soil type,slt,43,(Code table 4.213),Single/Surface,0,7,"0,0000004172325134",Other,,,,,,,,,,The soil type values are integers - there shouldn't be an error associated with them.
+Snow evaporation,es,44,m of water equivalent,Single/Surface,"-0,0006759828","0,00046391226","0,00000001739341826",Other,"0,00001",,,"0,01",,,,,,
+Snowmelt,smlt,45,m of water equivalent,Single/Surface,0,"0,014152765","0,00000021595406",,"0,00001",,,"0,01",,,,,,
+Maximum 10 metre wind gust since previous post-processing,10fg,49,m s**-1,Single/Surface,"0,41986078","54,995937","0,0008327648393",,"0,5",,"0,5",,,,,,,
+Large-scale precipitation fraction,lspf,50,s,Single/Surface,0,10800,"0,1647949219",Normal,,,,,,"0,01",,Energy,,
+Surface downward UV radiation,uvb,57,J m**-2,Single/Surface,0,1626784,"24,82275391",,3600,,, ,,,,,,"For all the accumulated radiative fluxes I am going for 1 W/m2 error in a 1-hr mean, implying a 3600 J/m2 error in the accumulation. Note that you can't easily convert this to a relative error, because accumulations can be a year long, and you might want the same error in hourly average or daily average flux for the final day as the beginning."
+Convective available potential energy,cape,59,J kg**-1,Single/Surface,0,"12867,5","0,1963424683",,,,,"0,01",,,,,,
+"Leaf area index, low vegetation",lai_lv,66,m**2 m**-2,Single/Surface,0,"4,6418457","0,00007082894444",,"0,0001",,,"0,01",,,,Energy,,
+"Leaf area index, high vegetation",lai_hv,67,m**2 m**-2,Single/Surface,0,"6,9730225","0,0001063998789",,"0,0001",,,"0,01",,,,Energy,,
+Standard deviation of filtered subgrid orography (climatological),sdfor,74,m,Single/Surface,0,"624,2425","0,00003720775203",,,,,"0,05",,"0,05",,,,
+Total column cloud liquid water,tclw,78,kg m**-2,Single/Surface,0,"2,956787","0,0000451169908",,"0,01",,,,,,,,,
+Total column cloud ice water,tciw,79,kg m**-2,Single/Surface,0,"1,9159546","0,00002923514694",,"0,01",,,,,,,,,
+Geopotential,z,129,m**2 s**-2,Single/Surface,"-1641,3057","53883,785","0,003309553256",,,,10,,,,,,,
+Surface pressure,sp,134,Pa,Single/Surface,"50471,63","105994,32","0,8472090364",,1,,1,"0,01",,,,,,
+Total column water,tcw,136,kg m**-2,Single/Surface,"0,065148175","101,19361","0,001543097897",,"0,1",,,,,,,,,
+Total column vertically-integrated water vapour,tcwv,137,kg m**-2,Single/Surface,"0,065148175","89,08228","0,001358293695",,"0,1",,,,,,,,,
+Snow depth,sd,141,m of water equivalent,Single/Surface,0,10,"0,0000005960464478",,"0,00001",,,"0,01",,,,,,
+Large-scale precipitation,lsp,142,m,Single/Surface,0,"0,050883293","0,000000776417437",,"0,00001",,,,,,,,,
+Convective precipitation,cp,143,m,Single/Surface,0,"0,035523415","0,0000005420442903",,"0,00001",,,,,,,,,
+Snowfall,sf,144,m of water equivalent,Single/Surface,0,"0,015250921","0,0000002327105904",,"0,00001",,,,,,,,,
+Time-integrated boundary layer dissipation,bld,145,J m**-2,Single/Surface,"7,0979776",5843611,"89,1663208",,1000,,,,,,,,,
+Time-integrated surface sensible heat net flux,sshf,146,J m**-2,Single/Surface,-9014216,4343472,"203,8221436",,1000,,,,,,,,,
+Time-integrated surface latent heat net flux,slhf,147,J m**-2,Single/Surface,-12757454,2789919,"237,2340851",,1000,,,,,,,,,
+Charnock,chnk,148,Numeric,Single/Surface,"0,0075299256","0,09239451","0,000001294930826",,,,,"0,01",,,,,,
+Mean sea level pressure,msl,151,Pa,Single/Surface,"92725,56","106307,375","0,207242012",,1,,1,,,,,,,
+Boundary layer height,blh,159,m,Single/Surface,"7,399293","6199,293","0,09448079765",,10,,10,,,,,,,
+Standard deviation of sub-gridscale orography,sdor,160,m,Single/Surface,0,"1033,3016","0,00006158957694",,1,,,"0,01",,,,,,
+Anisotropy of sub-gridscale orography,isor,161,Numeric,Single/Surface,0,"0,9807435","0,00000005845686957",,"0,05",,,"0,01",,,,,,
+Angle of sub-gridscale orography,anor,162,radians,Single/Surface,"-1,5593841","1,5629416","0,0000001861051118",,"0,1",,,"0,01",,,,,,
+Slope of sub-gridscale orography,slor,163,Numeric,Single/Surface,"0,000099999976","0,1350887","0,000000008045953237",,,,,"0,01",,,,,,
+Total cloud cover,tcc,164,dimensionless,Single/Surface,0,1,"0,00001525878906",,"0,05",,,"0,01",,,,,,
+10 metre U wind component,10u,165,m s**-1,Single/Surface,"-35,57274","29,245392","0,0009890461806",,"0,5",,,"0,5",,,,,,
+10 metre V wind component,10v,166,m s**-1,Single/Surface,"-30,720917","33,44426","0,0009790828917",,"0,5",,,"0,5",,,,,,
+2 metre temperature,2t,167,K,Single/Surface,"198,04323","324,60345","0,001931155799",,,,"0,05",,,,,,,
+2 metre dewpoint temperature,2d,168,K,Single/Surface,"194,528","305,39026","0,001691623824",,,,"0,05",,,,,,,
+Surface short-wave (solar) radiation downwards,ssrd,169,J m**-2,Single/Surface,0,13807872,"210,6914063",,3600,,,,,,,Energy,,The maximum/minimum values quoted for this and all the accumulated values are not likely to be representative as it depends on the accumulation time and I think a 1-day accumulation has been used.  In seasonal forecast and climate experiments we accumulate over 13 months so the magnitudes can be nearly 400 times more.
+Land-sea mask,lsm,172,dimensionless,Single/Surface,0,1,"0,00000005960464478",Other,,,,,,,,,,It is important that treshold of 0.5 can still be used to decide whether land or ocean
+Surface long-wave (thermal) radiation downwards,strd,175,J m**-2,Single/Surface,"529710,56","5709221,5","79,0330658",,3600,,,,,,,Energy,,
+Surface net short-wave (solar) radiation,ssr,176,J m**-2,Single/Surface,0,11775744,"179,6835938",,3600,,,,,,,Energy,,
+Surface net long-wave (thermal) radiation,str,177,J m**-2,Single/Surface,-4739560,739601,"83,60536194",,3600,,,,,,,Energy,,
+Top net short-wave (solar) radiation,tsr,178,J m**-2,Single/Surface,0,14224384,"217,046875",,3600,,,,,,,Energy,,
+Top net long-wave (thermal) radiation,ttr,179,J m**-2,Single/Surface,-4492495,-813580,"56,13578796",,3600,,,,,,,Energy,,
+Time-integrated eastward turbulent surface stress,ewss,180,N m**-2 s,Single/Surface,"-145328,5","98530,336","3,720990658",,100,,,,,,,,,
+Time-integrated northward turbulent surface stress,nsss,181,N m**-2 s,Single/Surface,"-98271,81","88200,69","2,845344543",,100,,,,,,,,,
+Evaporation,e,182,m of water equivalent,Single/Surface,"-0,0051013567","0,0011155764","0,00000009486286956",,,,,"0,01",,,,,,
+Low cloud cover,lcc,186,dimensionless,Single/Surface,0,1,"0,00001525878906",,,,,"0,01",,,,,,
+Medium cloud cover,mcc,187,dimensionless,Single/Surface,0,1,"0,00001525878906",,,,,"0,01",,,,,,
+High cloud cover,hcc,188,dimensionless,Single/Surface,0,1,"0,00001525878906",,,,,"0,01",,,,,,
+Eastward gravity wave surface stress,lgws,195,N m**-2 s,Single/Surface,"-94455,5","128995,21","3,409587383",,100,,,,,,,,,
+Northward gravity wave surface stress,mgws,196,N m**-2 s,Single/Surface,"-131455,62","135554,84","4,07425642",,100,,,,,,,,,
+Gravity wave dissipation,gwd,197,J m**-2,Single/Surface,"-1996,469","2915707,8","44,5206337",,1000,,,,,,,,,
+Skin reservoir content,src,198,m of water equivalent,Single/Surface,0,"0,0011899769","0,00000001815760697",,"0,00001",,,"0,01",,,,,,
+Maximum temperature at 2 metres since previous post-processing,mx2t,201,K,Single/Surface,"199,34102","324,73157","0,001913307933",,,,"0,05",,,,,,,
+Minimum temperature at 2 metres since previous post-processing,mn2t,202,K,Single/Surface,"199,18436","324,6175","0,001913957763",,,,"0,05",,,,,,,
+Runoff,ro,205,m,Single/Surface,0,"0,053337097","0,0000008138595149",,"0,00001",,,"0,01",,,,,,"It could be considered to remove this value, as ro=sro+ssro"
+Total column ozone,tco3,206,kg m**-2,Single/Surface,"0,0026929623","0,012685288","0,0000001524707898",,,,,"0,01",,,,Energy,,
+"Top net short-wave (solar) radiation, clear sky",tsrc,208,J m**-2,Single/Surface,0,14047232,"214,34375",,3600,,,,,,,Energy,,
+"Top net long-wave (thermal) radiation, clear sky",ttrc,209,J m**-2,Single/Surface,-4431609,-1013443,"52,15707397",,3600,,,,,,,Energy,,
+"Surface net short-wave (solar) radiation, clear sky",ssrc,210,J m**-2,Single/Surface,0,11730432,"178,9921875",,3600,,,,,,,Energy,,
+"Surface net long-wave (thermal) radiation, clear sky",strc,211,J m**-2,Single/Surface,-4597057,399221,"76,2371521",,3600,,,,,,,Energy,,
+TOA incident short-wave (solar) radiation,tisr,212,J m**-2,Single/Surface,0,14874880,"226,9726563",,3600,,,,,,,Energy,,
+Vertically integrated moisture divergence,vimd,213,kg m**-2,Single/Surface,"-89,64993","38,31877","0,001952647464",,,,,"0,01",,,,,,
+Total precipitation,tp,228,m,Single/Surface,0,"0,058294296","0,0000008895003702",,"0,00001",,,,,,,,,
+Instantaneous eastward turbulent surface stress,iews,229,N m**-2,Single/Surface,"-12,641745","9,264997","0,0003342703567",,,,,"0,01",,,,,,
+Instantaneous northward turbulent surface stress,inss,230,N m**-2,Single/Surface,"-8,780683","9,028752","0,0002717504103",,,,,"0,01",,,,,,
+Instantaneous surface sensible heat net flux,ishf,231,W m**-2,Single/Surface,"-755,95483","429,8999","0,01809470728",,"0,1",,,,,,,,,
+Instantaneous moisture flux,ie,232,kg m**-2 s**-1,Single/Surface,"-0,00042742398","0,00016970992","0,000000009111539967",,,,,"0,01",,,,,,
+Skin temperature,skt,235,K,Single/Surface,"195,50932","343,44434","0,002257309156",,,,"0,05",,,,,,,Peter B. recommends 0.01K for T fields for long integrations to make budgets
+Temperature of snow layer,tsn,238,K,Single/Surface,"198,55476","308,8557","0,001683058916",,,,"0,05",,,,,,,
+Convective snowfall,csf,239,m of water equivalent,Single/Surface,0,"0,009302139","0,0000001419393811",,"0,00001",,,,,,,,,
+Large-scale snowfall,lsf,240,m of water equivalent,Single/Surface,0,"0,015250921","0,0000002327105904",,"0,00001",,,,,,,,,
+Forecast albedo,fal,243,dimensionless,Single/Surface,"0,049640384","0,8779686","0,00001263928607",,"0,01",,,,,,,,,
+Forecast surface roughness,fsr,244,m,Single/Surface,"0,000024328765","1,9848","0,0000001183018483",,,,,"0,01",,,,,,
+Forecast logarithm of surface roughness for heat,flsr,245,Numeric,Single/Surface,"-12,180499","0,6854992","0,0001963195537",,,,,"0,01",,,,,,
+Convective inhibition,cin,228001,J kg**-1,Single/Surface,"0,000000014166922","1000,0008","0,01525880117",,1,,,,,,,,,
+Friction velocity,zust,228003,m s**-1,Single/Surface,"0,0032943622","2,359058","0,00003594609734",,,,,"0,01",,,,,,
+Lake total depth,dl,228007,m,Single/Surface,"0,49999994","8032,337","0,0004787347862",,"0,1",,,,,,,,,
+Lake mix-layer temperature,lmlt,228008,K,Single/Surface,"273,14966","312,5061","0,0006005316973",,"0,0001",,,"0,01",,,,,,
+Lake mix-layer depth,lmld,228009,m,Single/Surface,0,50,"0,0007629394531",,"0,1",,,"0,01",,,,,,
+Lake bottom temperature,lblt,228010,K,Single/Surface,"273,15967","308,8501","0,0005445927382",,"0,0001",,,"0,01",,,,,,
+Lake total layer temperature,ltlt,228011,K,Single/Surface,"273,15967","308,8501","0,0005445927382",,"0,0001",,,"0,01",,,,,,
+Lake shape factor,lshf,228012,dimensionless,Single/Surface,"0,5656386","0,80000216","0,000003576104064",,"0,00001",,,"0,01",,,,,,
+Lake ice surface temperature,lict,228013,K,Single/Surface,"197,45952","273,16208","0,001155129401",,"0,0001",,,"0,01",,,,,,
+Lake ice total depth,licd,228014,m,Single/Surface,0,"2,9902344","0,00004562735558",,"0,001",,,"0,01",,,,,,
+Minimum vertical gradient of refractivity inside trapping layer,dndzn,228015,m**-1,Single/Surface,"-1,000001","5,016967","0,00009181164205",,,,,"0,01",,,,,,
+Mean vertical gradient of refractivity inside trapping layer,dndza,228016,m**-1,Single/Surface,"-1,000001","2,4447622","0,00005256291479",,,,,"0,01",,,,,,
+Duct base height,dctb,228017,m,Single/Surface,"-1,000001","2203,1875","0,03363323212",,,,,"0,01",,,,,,
+Trapping layer base height,tplb,228018,m,Single/Surface,"-1,000001","2289,6875","0,03495311737",,,,,"0,01",,,,,,
+Trapping layer top height,tplt,228019,m,Single/Surface,"-1,000001","2487,1875","0,03796672821",,,,,"0,01",,,,,,
+Surface direct short-wave (solar) radiation,fdir,228021,J m**-2,Single/Surface,0,13122560,"200,234375",,3600,,,,,,,Energy,,
+"Surface direct short-wave radiation, clear sky",cdir,228022,J m**-2,Single/Surface,0,12769280,"194,84375",,3600,,,,,,,Energy,,
+Cloud base height,cbh,228023,m,Single/Surface,"21,805069","17760,72","0,2706743777",,1,,,,,,,,,
+0 degrees C isothermal level (atm),deg0l,228024,m,Single/Surface,0,"6815,875","0,1040019989",,1,,,,,,,,,
+Instantaneous 10 metre wind gust,i10fg,228029,m s**-1,Single/Surface,"0,30877757","52,88197","0,0008022032562",,"0,01",,,,,,,,,
+Total column supercooled liquid water,tcslw,228088,kg m**-2,Single/Surface,0,"1,4591064","0,00002226419747",,"0,01",,,,,,,,,
+Total column rain water,tcrw,228089,kg m**-2,Single/Surface,0,"6,3325195","0,00009662657976",,"0,01",,,,,,,,,
+Total column snow water,tcsw,228090,kg m**-2,Single/Surface,0,"15,666992","0,000239059329",,"0,01",,,,,,,,,
+Surface short-wave (solar) radiation downward clear-sky,ssrdc,228129,J m**-2,Single/Surface,0,13626112,"207,9179688",,3600,,,,,,,Energy,,
+Surface long-wave (thermal) radiation downward clear-sky,strdc,228130,J m**-2,Single/Surface,"519817,06",5706949,"79,14935303",,3600,,,,,,,Energy,,
+10 metre u-component of neutral wind,u10n,228131,m s**-1,Single/Surface,"-35,526276","29,296982","0,0009891244117",,"0,01",,,,,,,,,
+10 metre v-component of neutral wind,v10n,228132,m s**-1,Single/Surface,"-30,75679","33,482727","0,0009802172426",,"0,01",,,,,,,,,
+Instantaneous large-scale precipitation fraction,ilspf,228217,Proportion,Single/Surface,0,1,"0,00001525878906",,,,,"0,01",,,,,,
+Convective rain rate,crr,228218,kg m**-2 s**-1,Single/Surface,0,"0,004797101","0,00000007319795259",,,,,"0,01",,,,,,
+Large scale rain rate,lsrr,228219,kg m**-2 s**-1,Single/Surface,0,"0,0059996843","0,00000009154791769",,,,,"0,01",,,,,,
+Convective snowfall rate water equivalent,csfr,228220,kg m**-2 s**-1,Single/Surface,0,"0,0009200573","0,00000001403896022",,,,,"0,01",,,,,,
+Large scale snowfall rate water equivalent,lssfr,228221,kg m**-2 s**-1,Single/Surface,0,"0,0015558004","0,00000002373963071",,,,,"0,01",,,,,,
+Maximum total precipitation rate since previous post-processing,mxtpr,228226,kg m**-2 s**-1,Single/Surface,0,"0,0070943832","0,0000001082516974",,,,,"0,01",,,,,,
+Minimum total precipitation rate since previous post-processing,mntpr,228227,kg m**-2 s**-1,Single/Surface,0,"0,0060921907","0,00000009295945347",,,,,"0,01",,,,,,
+100 metre U wind component,100u,228246,m s**-1,Single/Surface,"-45,964752","39,560516","0,001305012032",,"0,01",,,,,,,,,
+100 metre V wind component,100v,228247,m s**-1,Single/Surface,"-41,978714","45,65535","0,001337189693",,"0,01",,,,,,,,,
+Potential evaporation,pev,228251,m,Single/Surface,"-0,005537186","0,0004110483","0,00000009076285323",,,,,"0,01",,,,,,
+Time-mean surface runoff rate,avg_surfror,235020,kg m**-2 s**-1,Single/Surface,0,"0,0037660003","0,00000005746460374",,"0,000001",,,"0,01",,,,,,
+Time-mean sub-surface runoff rate,avg_ssurfror,235021,kg m**-2 s**-1,Single/Surface,0,"0,0024867654","0,00000003794502845",,"0,000001",,,"0,01",,,,,,
+Time-mean snow evaporation rate water equivalent,avg_esrwe,235023,kg m**-2 s**-1,Single/Surface,"-0,000062590785","0,000042954052","0,000000001610486411",,"0,0000001",,,"0,01",,,,,,
+Time-mean snow melt rate,avg_smr,235024,kg m**-2 s**-1,Single/Surface,0,"0,0013104677","2,00E-08",,"0,000001",,,"0,01",,,,,,
+Time-mean large-scale precipitation fraction,avg_ilspf,235026,Proportion,Single/Surface,0,1,"1,53E-05",,,,,"0,01",,,,,,
+Time-mean surface downward UV radiation flux,avg_sduvrf,235027,W m**-2,Single/Surface,0,"150,6289","0,002298414707",,0.1,,,"0,01",,,,Energy,,
+Time-mean large-scale precipitation rate,avg_lsprate,235029,kg m**-2 s**-1,Single/Surface,0,"0,0047112703","7,19E-08",,,,,"0,01",,,,,,
+Time-mean convective precipitation rate,avg_cpr,235030,kg m**-2 s**-1,Single/Surface,0,"0,0032892227","5,02E-08",,,,,"0,01",,,,,,
+Time-mean total snowfall rate water equivalent,avg_tsrwe,235031,kg m**-2 s**-1,Single/Surface,0,"0,0014121234","2,15E-08",,,,,"0,01",,,,,,
+Time-mean boundary layer dissipation,avg_ibld,235032,W m**-2,Single/Surface,"0,00068298145","541,065","0,008255986497",,"0,1",,,,,,,,,
+Time-mean surface sensible heat flux,avg_ishf,235033,W m**-2,Single/Surface,"-834,6482","402,1775","0,01887246221",,"0,1",,,,,,,,,
+Time-mean surface latent heat flux,avg_slhtf,235034,W m**-2,Single/Surface,"-1181,2463","258,31226","0,02196592093",,"0,1",,,,,,,,,
+Time-mean surface downward short-wave radiation flux,avg_sdswrf,235035,W m**-2,Single/Surface,0,"1278,5","0,01950836182",,"0,1",,,,,,,Energy,,
+Time-mean surface downward long-wave radiation flux,avg_sdlwrf,235036,W m**-2,Single/Surface,"49,048218","528,6343","0,00731790252",,"0,1",,,,,,,Energy,,
+Time-mean surface net short-wave radiation flux,avg_snswrf,235037,W m**-2,Single/Surface,"1,00E-15","1090,3438","0,01663732529",,"0,1",,,,,,,Energy,,
+Time-mean surface net long-wave radiation flux,avg_snlwrf,235038,W m**-2,Single/Surface,"-438,84912","68,48364","0,007741283625",,"0,1",,,,,,,Energy,,
+Time-mean top net short-wave radiation flux,avg_tnswrf,235039,W m**-2,Single/Surface,0,"1317,0938","0,02009725571",,"0,1",,,,,,,Energy,,
+Time-mean top net long-wave radiation flux,avg_tnlwrf,235040,W m**-2,Single/Surface,"-415,9712","-75,33008","0,005197770894",,"0,1",,,,,,,Energy,,
+Time-mean eastward turbulent surface stress,avg_iews,235041,N m**-2,Single/Surface,"-13,456354","9,1230955","0,0003445350449",,,,,"0,01",,,,,,
+Time-mean northward turbulent surface stress,avg_inss,235042,N m**-2,Single/Surface,"-9,099227","8,166232","0,0002634499979",,,,,"0,01",,,,,,
+Time-mean moisture flux,avg_ie,235043,kg m**-2 s**-1,Single/Surface,"-0,00047234795","0,00010329019","8,78E-09",,,,,"0,01",,,,,,
+Time-mean eastward gravity wave surface stress,avg_iegwss,235045,N m**-2,Single/Surface,"-8,745907","11,944297","0,000315707468",,,,,"0,01",,,,,,
+Time-mean northward gravity wave surface stress,avg_ingwss,235046,N m**-2,Single/Surface,"-12,171822","12,55106","0,000377241231",,,,,"0,01",,,,,,
+Time-mean gravity wave dissipation,avg_igwd,235047,W m**-2,Single/Surface,"-0,1849078","269,9758","0,004122325219",,,,,"0,01",,,,,,
+Time-mean runoff rate water equivalent (surface plus subsurface),avg_rorwe,235048,kg m**-2 s**-1,Single/Surface,0,"0,0049386024","7,54E-08",,,,,"0,01",,,,,,
+"Time-mean top net short-wave radiation flux, clear sky",avg_tnswrfcs,235049,W m**-2,Single/Surface,0,"1300,6875","0,0198469162",,"0,1",,,,,,,Energy,,
+"Time-mean top net long-wave radiation flux, clear sky",avg_tnlwrfcs,235050,W m**-2,Single/Surface,"-410,3352","-93,833984","0,004829425365",,"0,1",,,,,,,Energy,,
+"Time-mean surface net short-wave radiation flux, clear sky",avg_snswrfcs,235051,W m**-2,Single/Surface,0,"1086,1562","0,01657342911",,"0,1",,,,,,,Energy,,
+"Time-mean surface net long-wave radiation flux, clear sky",avg_snlwrfcs,235052,W m**-2,Single/Surface,"-425,64917","36,967926","0,007058976684",,"0,1",,,,,,,Energy,,
+Time mean top downward short-wave radiation flux,avg_tdswrf,235053,W m**-2,Single/Surface,0,"1377,3125","0,02101612091",,"0,1",,,,,,,Energy,,
+Time-mean total column vertically-integrated moisture divergence flux,avg_vimdf,235054,kg m**-2 s**-1,Single/Surface,"-0,008300897","0,0035479628","1,81E-07",,,,,"0,01",,,,,,
+Time-mean total precipitation rate,avg_tprate,235055,kg m**-2 s**-1,Single/Surface,0,"0,0053976774","8,24E-08",,,,,"0,01",,,,,,
+Time-mean convective snowfall rate water equivalent,avg_csfr,235056,kg m**-2 s**-1,Single/Surface,0,"0,0008613169","1,31E-08",,,,,"0,01",,,,,,
+Time-mean large scale snowfall rate water equivalent,avg_lssfr,235057,kg m**-2 s**-1,Single/Surface,0,"0,0014121234","2,15E-08",,,,,"0,01",,,,,,
+Time-mean surface direct short-wave radiation flux,avg_sdirswrf,235058,W m**-2,Single/Surface,0,"1215,0312","0,01853990555",,"0,1",,,,,,,Energy,,
+"Time-mean surface direct short-wave radiation flux, clear sky",avg_sdirswrfcs,235059,W m**-2,Single/Surface,0,"1182,3438","0,01804113388",,"0,1",,,,,,,Energy,,
+"Time-mean surface downward short-wave radiation flux, clear sky",avg_sdswrfcs,235068,W m**-2,Single/Surface,0,"1261,6875","0,01925182343",,"0,1",,,,,,,Energy,,
+"Time-mean surface downward long-wave radiation flux, clear sky",avg_sdlwrfcs,235069,W m**-2,Single/Surface,"48,131348","528,412","0,007328500971",,"0,1",,,,,,,Energy,,
+Time-mean potential evaporation rate,avg_pevr,235070,kg m**-2 s**-1,Single/Surface,"-0,0005127026","3,81E-05","8,40E-09",,,,,"0,01",,,,,,
+Precipitation type,ptype,260015,(Code table 4.201),Single/Surface,0,8,"0,0001220703125",Other,,,,,,,,,,
+K index,kx,260121,K,Single/Surface,"-147,42314","45,75621","0,00294768298",,"0,1",,,,,,,,,
+Total totals index,totalx,260123,K,Single/Surface,"-58,761272","63,22233","0,001861321973",,"0,1",,,,,,,,,
+Significant wave height of first swell partition,swh1,140121,m,Single/Surface,0,"9,668502","0,0001475296303",,0.01,,,,,,,,,
+Mean wave direction of first swell partition,mwd1,140122,degrees,Single/Surface,"1,08E-05","359,99997","0,005493163597",,0.1,,,"0,01",,,,,,
+Mean wave period of first swell partition,mwp1,140123,s,Single/Surface,0,"28,96636","0,0004419915786",,0.01,,,"0,01",,,,,,
+Significant wave height of second swell partition,swh2,140124,m,Single/Surface,0,"6,2371826","9,52E-05",,0.01,,,"0,01",,,,,,
+Mean wave direction of second swell partition,mwd2,140125,degrees,Single/Surface,"5,69E-06",360,"0,005493164063",,0.1,,,"0,01",,,,,,
+Mean wave period of second swell partition,mwp2,140126,s,Single/Surface,0,"28,96669","0,0004419966135",,0.01,,,"0,01",,,,,,
+Significant wave height of third swell partition,swh3,140127,m,Single/Surface,0,"4,755615","7,26E-05",,0.01,,,,,,,,,
+Mean wave direction of third swell partition,mwd3,140128,degrees,Single/Surface,"4,48E-06","359,99997","0,005493163597",,0.1,,,"0,01",,,,,,
+Mean wave period of third swell partition,mwp3,140129,s,Single/Surface,0,"28,966309","0,0004419907928",,0.01,,,,,,,,,
+Wave Spectral Skewness,wss,140207,Numeric,Single/Surface,"0,00011390794","0,101444766","1,55E-06",,"1,00E-05",,,"0,01",,,,,,
+Free convective velocity over the oceans,wstar,140208,m s**-1,Single/Surface,0,"2,8270264","4,31E-05",,0.01,,,,,,,,,
+Air density over the oceans,rhoao,140209,kg m**-3,Single/Surface,"0,9978549","1,5407505","8,28E-06",,0,,,"0,01",,,,,,
+Normalized energy flux into waves,phiaw,140211,Numeric,Single/Surface,"-103,2204","6,4314117","0,001673153834",,0.01,,,"0,01",,,,,,
+Normalized energy flux into ocean,phioc,140212,Numeric,Single/Surface,"-2435,9817","-0,19750977","0,037167117",,0.03,,,"0,01",,,,,,
+Normalized stress into ocean,tauoc,140214,Numeric,Single/Surface,"0,72867155","20,141323","0,0002962135477",,0,,,"0,01",,,,,,
+U-component surface stokes drift,ust,140215,m s**-1,Single/Surface,"-0,8038976","0,93108517","2,65E-05",,0.01,,,"0,01",,,,,,
+V-component surface stokes drift,vst,140216,m s**-1,Single/Surface,"-0,78493565","0,89954257","2,57E-05",,0.01,,,"0,01",,,,,,
+Period corresponding to maximum individual wave height,tmax,140217,s,Single/Surface,"1,9529505","15,253711","0,0002029534953",,"0,01",,,,,,,,,
+Envelop-maximum individual wave height,hmax,140218,m,Single/Surface,"0,032169748","28,994614","0,0004419318284",,"0,01",,,,,,,,,
+Model bathymetry,wmb,140219,m,Single/Surface,5,999,"0,01516723633",,"0,1",,,,,,,,,
+Mean wave period based on first moment,mp1,140220,s,Single/Surface,"1,495059","15,01979","0,0002063710126",,"0,01",,,,,,,,,
+Mean zero-crossing wave period,mp2,140221,s,Single/Surface,"1,4050522","13,9297695","0,0001911120198",,"0,01",,,,,,,,,
+Wave spectral directional width,wdw,140222,radians,Single/Surface,"0,19933546","1,3744593","1,79E-05",,0.01,,,,,,,,,
+Mean wave period based on first moment for wind waves,p1ww,140223,s,Single/Surface,"1,4243002","13,101718","0,0001781832543",,"0,01",,,,,,,,,
+Mean wave period based on second moment for wind waves,p2ww,140224,s,Single/Surface,0,"11,68335","0,0001782737672",,"0,01",,,,,,,,,
+Wave spectral directional width for wind waves,dwww,140225,radians,Single/Surface,0,"1,4142455","2,16E-05",,"0,01",,,,,,,,,
+Mean wave period based on first moment for swell,p1ps,140226,s,Single/Surface,"1,495059","16,368708","0,0002269538672",,"0,01",,,,,,,,,
+Mean wave period based on second moment for swell,p2ps,140227,s,Single/Surface,"1,4050522","15,511257","0,0002152436064",,"0,01",,,,,,,,,
+Wave spectral directional width for swell,dwps,140228,radians,Single/Surface,"0,06416768","1,4095137","2,05E-05",,"0,01",,,,,,,,,
+Significant height of combined wind waves and swell,swh,140229,m,Single/Surface,"0,031689517","15,400932","0,0002345160319",,"0,01",,,,,,,,,
+Mean wave direction,mwd,140230,Degree true,Single/Surface,"0,00012719269","360,00363","0,005493217614",,"0,1",,,,,,,,,
+Peak wave period,pp1d,140231,s,Single/Surface,"1,8260269","21,098976","0,0002940818667",,"0,01",,,,,,,,,
+Mean wave period,mwp,140232,s,Single/Surface,"1,6203156","16,45109","0,0002262996568",,"0,01",,,,,,,,,
+Significant height of wind waves,shww,140234,m,Single/Surface,"2,51E-16","15,387207","0,0002347901464",,"0,01",,,,,,,,,
+Mean direction of wind waves,mdww,140235,degrees,Single/Surface,"1,21E-05","360,0035","0,005493217614",,"0,1",,,,,,,,,
+Mean period of wind waves,mpww,140236,s,Single/Surface,"1,5170059","14,739394","0,0002017576335",,"0,01",,,,,,,,,
+Significant height of total swell,shts,140237,m,Single/Surface,0,"9,903525","0,0001511158043",,"0,01",,,,,,,,,
+Mean direction of total swell,mdts,140238,degrees,Single/Surface,"1,59E-06","360,00372","0,005493220873",,"0,1",,,,,,,,,
+Mean period of total swell,mpts,140239,s,Single/Surface,"1,6203156","17,534609","0,0002428328444",,"0,01",,,,,,,,,
+Mean square slope of waves,msqs,140244,dimensionless,Single/Surface,"2,09E-05","0,0608462","9,28E-07",,0,,,"0,01",,,,,,
+Wave spectral kurtosis,wsk,140252,dimensionless,Single/Surface,"-0,33000004","0,9993343","2,03E-05",,0.01,,,"0,01",,,,,,
+Benjamin-Feir index,bfi,140253,dimensionless,Single/Surface,-10,"9,995083","0,0003051007516",,0.01,,,"0,01",,,,,,
+Wave spectral peakedness,wsp,140254,dimensionless,Single/Surface,"1,1663694","43,99974","0,0006535853609",,0.01,,,"0,01",,,,,,
+Coefficient of drag with waves,cdww,140233,dimensionless,Single/Surface,"0,0006759383","0,0065243226","8,92E-08",,0,,,"0,01",,,,,,
+10 metre wind speed,wind,140245,m s**-1,Single/Surface,2,"34,67285","0,0004985481501",,0.01,,,,,,,,,
+10 metre wind direction,dwi,140249,degrees,Single/Surface,"1,63E-06","360,00375","0,005493221339",,0.1,,,,,,,,,
+Ice temperature layer 1,istl1,35,K,Single/Surface,"230,60287","273,161","0,0006493856199",,"0,002",,,"0,01",,,,,,
+Ice temperature layer 2,istl2,36,K,Single/Surface,"235,49886","273,16095","0,0005746779498",,"0,002",,,"0,01",,,,,,
+Ice temperature layer 3,istl3,37,K,Single/Surface,"249,33615","272,6903","0,0003563561477",,"0,002",,,"0,01",,,,,,
+Ice temperature layer 4,istl4,38,K,Single/Surface,"261,85376","272,01196","0,0001550018787",,"0,002",,,"0,01",,,,,,
+Volumetric soil water layer 1,swvl1,39,m**3 m**-3,Single/Surface,"-0,014633857","0,76602054","1,19E-05",Other,"0,00001",,,"0,01",,,,Water,,I am a bit concerned that a bounded variable can have negative values (CR)
+Volumetric soil water layer 2,swvl2,40,m**3 m**-3,Single/Surface,"-0,0005851451","0,76594377","1,17E-05",Log-Normal,"0,00001",,,"0,01",,,,Water,,I am a bit concerned that a bounded variable can have negative values (CR)
+Volumetric soil water layer 3,swvl3,41,m**3 m**-3,Single/Surface,"-0,00057547074","0,765307","1,17E-05",Normal,"0,00001",,,"0,01",,,,Water,,I am a bit concerned that a bounded variable can have negative values (CR)
+Volumetric soil water layer 4,swvl4,42,m**3 m**-3,Single/Surface,0,"0,7555084","1,15E-05",Normal,"0,00001",,,"0,01",,,,Water,,I am a bit concerned that a bounded variable can have negative values (CR)
+Soil temperature level 1,stl1,139,K,Single/Surface,"212,61618","337,20554","0,001901082695",Normal,"0,002",,,"0,01",,,,Energy,,
+Soil temperature level 2,stl2,170,K,Single/Surface,"212,99513","320,13095","0,001634762855",Normal,"0,002",,,"0,01",,,,Energy,,
+Soil temperature level 3,stl3,183,K,Single/Surface,"213,98178","316,1906","0,001559583005",Normal,"0,002",,,"0,01",,,,Energy,,
+Soil temperature level 4,stl4,236,K,Single/Surface,"214,95973","313,81348","0,001508388435",Normal,"0,002",,,"0,01",,,,Energy,,
+lightning flash densities,,"228050, 228051, 228052, 228053, 228057, 228058, 228059, 228060",flashes/km2/day,,0,400,,,,,,0.01,,,,,,Feedback from Philippe Lopez
diff --git a/src/dc_toolkit/utils.py b/src/dc_toolkit/utils.py
index c665d8e..3560077 100644
--- a/src/dc_toolkit/utils.py
+++ b/src/dc_toolkit/utils.py
@@ -5,16 +5,15 @@
 #
 # Please, refer to the LICENSE file in the root directory.
 # SPDX-License-Identifier: BSD-3-Clause
-
-import sys
 import os
-import shutil
 import math
-import zipfile
 import click
 import humanize
-import uuid
+import threading
+import asyncio
 from pathlib import Path
+from typing import Tuple, Optional, List
+
 import numpy as np
 import dask
 import dask.array
@@ -30,6 +29,7 @@
 from mpi4py import MPI
 import time
 from collections import defaultdict
+from itertools import product
 import atexit
 import re
 
@@ -42,327 +42,852 @@
 os.environ["EBCC_LOG_LEVEL"] = "4"  # ERROR (suppress WARN and below)
 
 
-def get_indexes(arr, indices):
-    codec_to_id = []
-    for ind in indices:
-        codec_to_id.append(ind[1:-1].split(', ', 1))
-    id_ls = []
-    codec_id_dict = {key: val for val, key in codec_to_id}
-    for item in arr:
-        if item == "None":
-            id_ls.append(-1)
-        elif item in list(codec_id_dict.keys()):
-            id_ls.append(codec_id_dict[item])
-        else:
-            if "EBCC" in item:
-                fetch_new_idx = [value for key, value in codec_id_dict.items() if "EBCC" in key][0]
-                id_ls.append(fetch_new_idx)
-            else:
-                return IndexError(f'{item} not in list {list(codec_id_dict.keys())}')
-    return np.asarray(id_ls)
+class CombinationProducedNonFiniteError(Exception):
+    """
+    Raised when a codec combination's decoded sample contains NaN or
+    +/-inf values, signalling that the (compressor, filter, serializer)
+    triple is unsuitable for this field's value range.
+
+    Caught by the per-combo try/except in `cli.evaluate_combos`, which
+    routes it to `failures_<var>_rank<n>.csv` with a clear reason
+    string.  Without this exception, the float64 cast in the metrics
+    loop would raise a `RuntimeWarning: invalid value encountered in
+    cast` per non-finite chunk - the combo would still be filtered out
+    by the L1 threshold downstream, but the warning floods the SLURM
+    log (191 occurrences in production job 843234) and the failure
+    reason wouldn't be recorded explicitly.
+    """
+    pass
 
 
+class SampleTooLargeError(Exception):
+    """
+    Raised by `build_representative_sample` when a field's irreducible
+    spatial footprint exceeds the requested size limit.
+
+    "Irreducible" means: after striding every available time-like and
+    vertical-like dim down to a single index, a single horizontal slab
+    of the field still exceeds the byte limit.  Subsampling horizontal
+    dims is not permitted because it changes the spatial structure
+    codecs exploit during compression scoring.
+
+    Callers should surface this with a clear remedy (raise the limit,
+    drop --threads-per-rank, or move to a larger node).  The exception
+    carries the irreducible byte count and the limit it failed for so
+    the caller can format a precise message.
+    """
+    def __init__(self, message, irreducible_bytes=None, size_limit_bytes=None,
+                 dims=None, spatial_dims=None):
+        super().__init__(message)
+        self.irreducible_bytes = irreducible_bytes
+        self.size_limit_bytes  = size_limit_bytes
+        self.dims              = dims
+        self.spatial_dims      = spatial_dims
+
+
+# =============================================================================
+# SIZE PARSING
+# =============================================================================
+
+_SIZE_UNITS = {
+    "B":   1,
+    "KB":  10**3,  "MB":  10**6,  "GB":  10**9,  "TB":  10**12,
+    "KIB": 2**10,  "MIB": 2**20,  "GIB": 2**30,  "TIB": 2**40,
+}
+
+
+def parse_size(size_str: str) -> int:
+    """Parse human-readable sizes like '5GB', '500MiB', '10GiB' to bytes."""
+    if isinstance(size_str, (int, float)):
+        return int(size_str)
+    s = str(size_str).strip().upper().replace(" ", "")
+    for unit in sorted(_SIZE_UNITS.keys(), key=lambda u: -len(u)):
+        if s.endswith(unit):
+            return int(float(s[: -len(unit)]) * _SIZE_UNITS[unit])
+    # No unit -> treat as bytes
+    return int(float(s))
+
+
+# =============================================================================
+# DATASET OPENING
+# =============================================================================
+
 def open_zarr_memstore():
-    store = zarr.storage.MemoryStore()
-    return store
-
-def open_zarr_zipstore(zarr_zipstore_file: str):
-    store = zarr.storage.ZipStore(zarr_zipstore_file, read_only=True)
-    return zarr.open_group(store, mode='r'), store
-
-def open_zarr_localstore(zarr_localstore_file: str):
-    store = zarr.storage.LocalStore(zarr_localstore_file, read_only=True)
-    return zarr.open_group(store, mode='r'), store
-
-
-def open_dataset(dataset_file: str, field_to_compress: str | None = None, field_percentage_to_compress: float | None = None, rank: int = 0):
-    dataset_filepath = Path(dataset_file)
-    suffixes = dataset_filepath.suffixes
-    if dataset_filepath.suffix == ".nc":
-        ds = xr.open_dataset(dataset_file, chunks="auto")  # auto for Dask backend
-    elif dataset_filepath.suffix == ".grib":
-        ds = xr.open_dataset(dataset_file, chunks="auto", engine="cfgrib", backend_kwargs={"indexpath": ""})
-    elif suffixes[-2:] == [".zarr", ".zip"]:
-        store = open_zarr_zipstore(dataset_file)[1]
-        ds = xr.open_zarr(store, chunks="auto", consolidated=False)
-        store.close()
-    elif dataset_filepath.suffix == ".zarr":
-        store = open_zarr_localstore(dataset_file)[1]
-        ds = xr.open_zarr(store, chunks="auto", consolidated=False)
-        store.close()
+    return zarr.storage.MemoryStore()
+
+
+def open_zarr_localstore(path: str, read_only: bool = True):
+    """Open a zarr v3 LocalStore.  Returns (group, store).
+
+    IMPORTANT: do NOT close the store before you finish reading through the
+    returned group.  Keep both alive for the lifetime of any dask graph
+    that derives from it.
+    """
+    store = zarr.storage.LocalStore(path, read_only=read_only)
+    return zarr.open_group(store, mode="r" if read_only else "a"), store
+
+
+def open_dataset(
+    dataset_file: str,
+    field_to_compress: Optional[str] = None,
+    rank: int = 0,
+):
+    """
+    Open a dataset lazily with a dask backend.
+
+    Supports: .nc, .grib, .zarr (pure LocalStore).
+    .zarr.zip is NO LONGER SUPPORTED.
+
+    Note: the store handle is kept alive inside the xarray Dataset.  Do not
+    close it externally.  Zarr v3 LocalStore doesn't require explicit close
+    for reads, but we let xarray own the lifecycle regardless.
+    """
+    p = Path(dataset_file)
+    suffix = p.suffix.lower()
+
+    if suffix == ".nc":
+        ds = xr.open_dataset(dataset_file, chunks="auto")
+    elif suffix == ".grib":
+        ds = xr.open_dataset(
+            dataset_file, chunks="auto", engine="cfgrib",
+            backend_kwargs={"indexpath": ""},
+        )
+    elif suffix == ".zarr":
+        # LocalStore: xarray keeps a reference through the dask graph.
+        # consolidated=None -> auto-detect; uses consolidated metadata if the
+        # store was processed by merge_compressed_fields, otherwise falls back
+        # to a full metadata scan.
+        ds = xr.open_zarr(dataset_file, chunks="auto", consolidated=None)
     else:
         if rank == 0:
-            click.echo(f"Unsupported file format: {dataset_filepath.suffix}. Only .nc/.grib/.zarr/.zarr.zip are supported.")
+            click.echo(
+                f"Unsupported file format: {suffix}. "
+                f"Only .nc / .grib / .zarr are supported "
+                f"(.zarr.zip was removed in the refactor)."
+            )
             click.echo("Aborting...")
-        sys.exit(1)
+        # Collective abort: sys.exit on rank 0 alone would hang siblings at
+        # the next collective.  Abort(1) tears down the whole MPI world.
+        MPI.COMM_WORLD.Abort(1)
 
     if field_to_compress is not None and field_to_compress not in ds.data_vars:
         if rank == 0:
-            click.echo(f"Field {field_to_compress} not found in NetCDF file.")
-            click.echo(f"Available fields in the dataset: {list(ds.data_vars.keys())}.")
+            click.echo(f"Field {field_to_compress} not found in dataset.")
+            click.echo(f"Available fields: {list(ds.data_vars.keys())}.")
             click.echo("Aborting...")
-        sys.exit(1)
+        MPI.COMM_WORLD.Abort(1)
 
     if rank == 0:
-        click.echo(f"dataset_file.nbytes = {humanize.naturalsize(ds.nbytes, binary=True)}")
+        click.echo(f"dataset.nbytes = {humanize.naturalsize(ds.nbytes, binary=True)}")
         if field_to_compress is not None:
-            nbytes = ds[field_to_compress].nbytes * (field_percentage_to_compress / 100) if field_percentage_to_compress else ds[field_to_compress].nbytes
-            click.echo(f"field_to_compress.nbytes = {humanize.naturalsize(nbytes, binary=True)}")
+            click.echo(
+                f"{field_to_compress}.nbytes = "
+                f"{humanize.naturalsize(ds[field_to_compress].nbytes, binary=True)}"
+            )
 
     return ds
 
 
 def is_lat_lon(da):
-
-    lat_pattern = r'lat'
-    lon_pattern = r'lon'
-
     dims = da.dims
-
-    if len(dims) == 2 and re.search(lat_pattern, dims[0]) and re.search(lon_pattern, dims[1]):
-        return True
-
-    return False
+    return (
+        len(dims) == 2
+        and re.search(r"lat", dims[0]) is not None
+        and re.search(r"lon", dims[1]) is not None
+    )
 
 
-def _zip_zarr_dir(src_dir: str, dst_zip: str) -> None:
-    """
-    From a zarr LocalStore directory to a Zarr ZipStore file.
+# =============================================================================
+# REPRESENTATIVE SAMPLING  (replaces the old corner-slice strategy)
+# =============================================================================
+
+# Time-like dim name patterns; FALLBACK when CF metadata is unavailable.
+# Primary detection uses CF-conventions attributes on the coord variable.
+_TIME_LIKE_DIM_RE = re.compile(
+    r'^(?:time|.*_time|t|step|forecast_reference_time|forecast_period|'
+    r'ensemble|realization|member|reftime|valid_time|epoch)$',
+    re.IGNORECASE,
+)
+
+# CF time-units pattern: "<unit> since <reference time>".
+_CF_TIME_UNITS_RE = re.compile(r'^\s*\w+\s+since\s+', re.IGNORECASE)
+
+# CF standard_name values that mark a time-related axis.
+_CF_TIME_STANDARD_NAMES = frozenset({
+    'time',
+    'forecast_reference_time',
+    'forecast_period',
+})
+
+
+def _is_time_like_coord(da: xr.DataArray, dim_name: str) -> bool:
+    """Decide whether `dim_name` of `da` is a time axis.
+
+    Order of evidence (most authoritative first):
+      1. Corresponding coord variable's `units` attr starts with a CF
+         time-units pattern like 'days since', 'hours since' — this is
+         the canonical CF time signature; no other coordinate type uses
+         this format.
+      2. `standard_name` is one of the CF time-axis standard names.
+      3. `axis` attr is 'T' (CF specifies four axis-type codes:
+         X, Y, Z, T).
+      4. `calendar` attr is present.  Only time vars have calendars,
+         so this is a strong positive signal even when other CF attrs
+         are missing.
+      5. Fallback: dim NAME matches `_TIME_LIKE_DIM_RE`.
+
+    Both `coord.attrs` and `coord.encoding` are checked because
+    xarray's `decode_times=True` (the default) moves CF time attributes
+    from `attrs` to `encoding` after parsing — we have to look in both
+    to be robust to the caller's open_dataset choices.
+
+    Dims without a corresponding coord variable (which happens in
+    netCDF when a dim isn't backed by a same-name 1-D variable) skip
+    straight to the name-regex fallback.
+
+    Returns True on the first positive signal.
     """
-    with zipfile.ZipFile(dst_zip, mode="w", compression=zipfile.ZIP_STORED) as zf:
-        for root, _, files in os.walk(src_dir):
-            for fn in files:
-                abs_path = os.path.join(root, fn)
-                arcname = os.path.relpath(abs_path, src_dir)
-                zf.write(abs_path, arcname)
-
+    coord = da.coords.get(dim_name)
+    if coord is not None:
+        # Merge attrs and encoding; encoding wins on conflicts because
+        # decoded time vars store the original units/calendar in
+        # encoding rather than attrs.
+        merged = {**dict(coord.attrs), **dict(coord.encoding)}
 
-def compress_with_zarr(data, dataset_file, field_to_compress, where_to_write, filters, compressors, serializer, verbose=True, rank=0):
-    assert isinstance(data.data, dask.array.Array)
+        units = merged.get('units')
+        if isinstance(units, str) and _CF_TIME_UNITS_RE.match(units):
+            return True
 
-    basename = os.path.basename(dataset_file)
-    base_name_str = f"{basename}.=.field_{field_to_compress}.=.rank_{rank}"
-    
-    # Generate a unique ID for the intermediate directory to avoid File System issues
-    unique_id = uuid.uuid4().hex[:8]
-    temp_zarr_dir = os.path.join(where_to_write, f"{base_name_str}.={unique_id}.zarr")
-    zarr_zip_file = os.path.join(where_to_write, f"{base_name_str}.zarr.zip")
-    
-    dtype_zarr_parsed = zarr.core.dtype.parse_dtype(data.dtype, zarr_format=3)
+        std_name = merged.get('standard_name')
+        if isinstance(std_name, str) and std_name in _CF_TIME_STANDARD_NAMES:
+            return True
 
-    codecs = []
+        if merged.get('axis') == 'T':
+            return True
 
-    if filters is not None:
-        codecs.extend(filters)
+        if 'calendar' in merged:
+            return True
 
-    if serializer != "auto":
-        codecs.append(serializer)
-    else:
-        codecs.append(zarr.core.array.default_serializer_v3(dtype_zarr_parsed))
+    return bool(_TIME_LIKE_DIM_RE.match(dim_name))
 
-    if compressors is not None:
-        codecs.extend(compressors)
 
-    if filters is None and compressors is None and serializer == "auto":
-        codecs = None
+def _find_time_like_dim(da: xr.DataArray) -> Tuple[Optional[int], Optional[str]]:
+    """First time-like dim's (index, name) in `da.dims`, or (None, None)
+    if no dim qualifies under either CF metadata or the name fallback."""
+    for i, name in enumerate(da.dims):
+        if _is_time_like_coord(da, name):
+            return i, name
+    return None, None
 
-    with Timer("dask.array.to_zarr"):
-        store = zarr.storage.LocalStore(temp_zarr_dir, read_only=False)
-        # Avoid zarr.create_array, because it does not work with data that come from zarr!
-        # TODO: Update to the latest API
-        dask.array.to_zarr(
-            data.data,
-            store,
-            component=field_to_compress,
-            overwrite=True,
-            **{
-                "zarr_format": 3,
-                "dimension_names": data.dims,
-                "codecs": codecs,
-            },
-        )
-        store.close()
-        
-        # Explicitly remove the zip file if it exists from a previous loop iteration.
-        # This forces the filesystem to allocate a new inode, preventing stale read caches.
-        if os.path.exists(zarr_zip_file):
-            try:
-                os.remove(zarr_zip_file)
-            except OSError:
-                pass
 
-        _zip_zarr_dir(temp_zarr_dir, zarr_zip_file)
-        
-        # Clean up the unique temp dir
-        shutil.rmtree(temp_zarr_dir, ignore_errors=True)
+def _is_vertical_like_coord(da: xr.DataArray, dim_name: str) -> bool:
+    """
+    Classify a dim as vertical (level/height/depth-like).
+
+    Order of evidence (most authoritative first), mirroring
+    `_is_time_like_coord`:
+      1. Coord's `axis` attr is 'Z' (CF standard for vertical).
+      2. Coord's `standard_name` matches a vertical-axis CF name
+         (height, altitude, depth, atmosphere_*_coordinate, ...).
+      3. Coord's `positive` attr is set ('up' or 'down') — CF vertical
+         marker that's allowed even without axis=Z.
+      4. Name fallback via `_is_vertical_like_dim` (the existing name
+         heuristic used by `_shrink_order`).
+    """
+    coord = da.coords.get(dim_name)
+    if coord is not None:
+        merged = {**dict(coord.attrs), **dict(coord.encoding)}
+        if merged.get('axis') == 'Z':
+            return True
+        sn = merged.get('standard_name')
+        if isinstance(sn, str) and sn.lower() in {
+            'height', 'altitude', 'depth', 'air_pressure', 'pressure',
+            'model_level_number', 'atmosphere_hybrid_sigma_pressure_coordinate',
+            'atmosphere_hybrid_height_coordinate',
+            'atmosphere_sigma_coordinate',
+            'atmosphere_ln_pressure_coordinate',
+            'atmosphere_sleve_coordinate',
+        }:
+            return True
+        if 'positive' in merged and str(merged['positive']).lower() in ('up', 'down'):
+            return True
+
+    return _is_vertical_like_dim(dim_name)
+
+
+def _classify_sample_dims(
+    da: xr.DataArray,
+) -> Tuple[List[Tuple[int, str]], List[Tuple[int, str]], List[Tuple[int, str]]]:
+    """
+    Classify every dim of `da` into one of:
+      - time-like (CF axis=T or time-units or name regex)
+      - vertical-like (CF axis=Z or vertical CF standard_name or name regex)
+      - spatial (everything else; preserved during sampling)
+
+    Returns (time_dims, vertical_dims, spatial_dims), each as a list of
+    (position-in-da.dims, name).  Time and vertical are STRIDE dims;
+    spatial dims are preserved whole so codec scoring sees the real
+    spatial structure of the field.
+
+    Note: the time check runs first.  If a coord is both T- and Z-like
+    (impossible under CF, but defensive), it's classified as time.
+    """
+    time_dims: List[Tuple[int, str]]     = []
+    vertical_dims: List[Tuple[int, str]] = []
+    spatial_dims: List[Tuple[int, str]]  = []
+    for i, name in enumerate(da.dims):
+        if _is_time_like_coord(da, name):
+            time_dims.append((i, name))
+        elif _is_vertical_like_coord(da, name):
+            vertical_dims.append((i, name))
+        else:
+            spatial_dims.append((i, name))
+    return time_dims, vertical_dims, spatial_dims
 
-    group, store = open_zarr_zipstore(zarr_zip_file)
-    z = group[field_to_compress]
-    z_dask = dask.array.from_zarr(z)
 
-    info_array = z.info_complete()
-    compression_ratio = info_array._count_bytes / info_array._count_bytes_stored
-    if verbose and rank == 0:
-        click.echo(80* "-")
-        click.echo(info_array)
+def build_representative_sample(
+    da: xr.DataArray,
+    size_limit_bytes: int,
+    rank: int = 0,
+) -> xr.DataArray:
+    """
+    Return a subset of `da` that fits STRICTLY within `size_limit_bytes`,
+    built to be representative of the full field.
+
+    Strategy
+    --------
+    - If the whole field fits under the limit: return it unchanged.
+    - Otherwise: classify dims into time / vertical / spatial.  Spatial
+      dims (horizontal grid: lat, lon, cell, ncells, x, y, ...) are
+      preserved whole so codecs still see the real spatial structure
+      they exploit during scoring.  Time and vertical dims are
+      stride-sampled.
+    - Reduction across multiple stride dims is distributed in log-space
+      so each dim contributes proportional variety.  Small stride dims
+      that don't need full reduction clamp early and free budget for
+      the larger ones.
+    - Within each strided dim, evenly-spaced indices are picked via
+      `np.linspace`, mirroring the single-dim behaviour:
+      deterministic, edge-inclusive, reproducible across
+      `evaluate_combos` and `compress_with_optimal`.
+    - If a field's irreducible spatial footprint (1 element along every
+      stride dim) exceeds the limit, this raises `SampleTooLargeError`
+      rather than silently violating the budget.  Pre-patch the
+      function used `max(1, …)` and accepted the budget violation; that
+      under-budgeted memory model is what caused the production OOMs
+      on R02B10 out_15 (300 GiB fields, single time slice = 37.5 GiB
+      against a 5 GB budget).
+
+    Memory contract
+    ---------------
+    The caller is entitled to assume `sampled.nbytes <= size_limit_bytes`
+    on successful return.  This is the foundation of the per-rank steady
+    estimate in `cli._per_rank_steady_estimate_bytes` and the cgroup
+    headroom check.
+
+    Errors
+    ------
+    SampleTooLargeError: irreducible spatial footprint > size_limit_bytes.
+      Carries `irreducible_bytes`, `size_limit_bytes`, `dims`,
+      `spatial_dims` for caller-side message formatting.
+    """
+    nbytes = int(da.dtype.itemsize) * int(np.prod(da.shape))
+    if nbytes <= size_limit_bytes:
+        if rank == 0:
+            click.echo(
+                f"[sample] field fits under limit "
+                f"({humanize.naturalsize(nbytes, binary=True)} "
+                f"<= {humanize.naturalsize(size_limit_bytes, binary=True)}); "
+                f"evaluating on full field."
+            )
+        return da
+
+    time_dims, vertical_dims, spatial_dims = _classify_sample_dims(da)
+    # Stride priority: time first (captures temporal variation, the most
+    # informative axis for codec scoring across forecast steps), then vertical.
+    stride_dims: List[Tuple[int, str]] = list(time_dims) + list(vertical_dims)
+
+    if not stride_dims:
+        # Nothing safe to thin — return whole field with a warning.  In
+        # practice the variables that hit this branch are small bookkeeping
+        # arrays (CF coord-bounds, SCRIP remap weights) where being a few
+        # GiB over isn't a problem.  A LARGE variable without any time or
+        # vertical axis would deserve human review; surface that via the
+        # warning so it doesn't pass silently.
+        if rank == 0:
+            click.echo(
+                f"[sample] WARNING: variable '{da.name}' has dims {da.dims} "
+                f"with no time-like or vertical axis; cannot stride-sample.  "
+                f"Returning the full field "
+                f"({humanize.naturalsize(nbytes, binary=True)}), which "
+                f"exceeds the "
+                f"{humanize.naturalsize(size_limit_bytes, binary=True)} cap.  "
+                f"Normal for small bookkeeping arrays; investigate if the "
+                f"variable is large and downstream OOMs."
+            )
+        return da
 
-    with Timer("compute_errors_distances"):
-        pprint_, errors, euclidean_distance, normalized_euclidean_distance = compute_errors_distances(z_dask, data.data)
-    if verbose and rank == 0:
-        click.echo(80* "-")
-        click.echo(pprint_)
-        click.echo(80* "-")
-        click.echo(f"Euclidean Distance: {euclidean_distance}")
-        click.echo(80* "-")
+    # Irreducible footprint = bytes at 1 index along every stride dim.
+    spatial_sizes = [da.sizes[d] for _, d in spatial_dims]
+    irreducible_bytes = int(da.dtype.itemsize) * int(
+        np.prod(spatial_sizes) if spatial_sizes else 1
+    )
 
-    store.close()
+    if irreducible_bytes > size_limit_bytes:
+        spatial_names = [d for _, d in spatial_dims]
+        msg = (
+            f"variable '{da.name}' has irreducible spatial footprint "
+            f"{humanize.naturalsize(irreducible_bytes, binary=True)} "
+            f"(spatial dims {spatial_names}) which exceeds the "
+            f"{humanize.naturalsize(size_limit_bytes, binary=True)} budget.  "
+            f"Spatial dims must be preserved for codec representativeness.  "
+            f"Remedies (any one):\n"
+            f"  - raise --eval-data-size-limit (memory permitting)\n"
+            f"  - reduce --threads-per-rank to free memory for a larger sample\n"
+            f"  - request more RAM (#SBATCH --mem=0) or a larger node"
+        )
+        if rank == 0:
+            click.echo(f"[sample] FATAL: {msg}")
+        raise SampleTooLargeError(
+            msg,
+            irreducible_bytes=irreducible_bytes,
+            size_limit_bytes=int(size_limit_bytes),
+            dims=tuple(da.dims),
+            spatial_dims=tuple(spatial_names),
+        )
 
-    return compression_ratio, errors, euclidean_distance
+    # Distribute reduction across stride dims via greedy log-space
+    # distribution.  Sort by size ascending so dims that don't need
+    # full reduction clamp early and pass their freed budget upward.
+    max_product = float(size_limit_bytes) / float(irreducible_bytes)
+    stride_info = [(i, name, int(da.sizes[name])) for i, name in stride_dims]
+    by_size_asc = sorted(stride_info, key=lambda t: t[2])
+
+    plan: dict = {}  # dim_name -> n_keep
+    remaining_budget = max_product
+    remaining_dims = len(stride_info)
+    for _, name, size in by_size_asc:
+        if remaining_dims > 0:
+            target = remaining_budget ** (1.0 / remaining_dims)
+        else:
+            target = 1.0
+        n_keep = max(1, min(size, int(target)))
+        plan[name] = n_keep
+        remaining_budget = remaining_budget / max(1, n_keep)
+        remaining_dims -= 1
+
+    # Build isel dict; only set indices for dims we actually thinned.
+    indices_isel: dict = {}
+    for _, name, size in stride_info:
+        n_keep = plan[name]
+        if n_keep < size:
+            idx = np.linspace(0, size - 1, num=n_keep, dtype=int)
+            indices_isel[name] = np.unique(idx).tolist()
+
+    sampled = da.isel(indices_isel) if indices_isel else da
 
+    if rank == 0:
+        # Compact plan summary, in original dim order:
+        plan_parts = []
+        for _, name in stride_dims:
+            plan_parts.append(f"{name}={plan[name]}/{da.sizes[name]}")
+        spatial_part = (
+            f" | preserved spatial: {', '.join(d for _, d in spatial_dims)}"
+            if spatial_dims else ""
+        )
+        click.echo(
+            f"[sample] field is "
+            f"{humanize.naturalsize(nbytes, binary=True)} > limit "
+            f"{humanize.naturalsize(size_limit_bytes, binary=True)}; "
+            f"strided {', '.join(plan_parts)}"
+            f"{spatial_part} -> "
+            f"{humanize.naturalsize(int(sampled.nbytes), binary=True)}."
+        )
 
-def compute_errors_distances(da_compressed, da):
-    da_error = da_compressed - da
+    return sampled
+
+
+# =============================================================================
+# CHUNK & SHARD SIZING
+# =============================================================================
+
+# -----------------------------------------------------------------------------
+# Vertical-dim recognition for hiopy-style shrink order.
+# -----------------------------------------------------------------------------
+_VERTICAL_DIM_NAMES = {
+    "lev", "level", "levels", "plev", "plevs", "pressure", "pressure_level",
+    "height", "altitude", "alt", "depth", "z",
+    "model_level", "model_level_number", "ml",
+    "vertical", "vert",
+    "bottom_top", "bottom_top_stag",
+    "mlev", "ilev", "lev_p", "lev_l", "soil_layers_stag",
+    "isobaric", "isobaric1", "isobaric2",
+    "sigma", "sigma_level",
+    "hybrid", "hybrid_level",
+}
+
+
+def _is_vertical_like_dim(name) -> bool:
+    """True if `name` looks like a vertical (level/height/depth) dim."""
+    if name is None:
+        return False
+    n = str(name).lower().strip()
+    if n in _VERTICAL_DIM_NAMES:
+        return True
+    if n.startswith(("lev", "plev", "ilev", "mlev")):
+        return True
+    if n.endswith(("_lev", "_level", "_levels")):
+        return True
+    return False
 
-    # These are still lazy Dask computations
-    norm_L1_error = np.abs(da_error).sum()
-    norm_L2_error = np.sqrt((da_error**2).sum())
-    norm_Linf_error = np.abs(da_error).max()
 
-    norm_L1_original = np.abs(da).sum()
-    norm_L2_original = np.sqrt((da**2).sum())
-    norm_Linf_original = np.abs(da).max()
+def _shrink_order(shape, dims) -> list:
+    """
+    Return axis indices in the order they should be shrunk when the chunk
+    is over target.  Skips axis 0 (the leading dim, handled separately).
 
-    # Group all into one call to compute for efficiency
-    computed = dask.compute(
-        norm_L1_error,
-        norm_L1_original,
-        norm_L2_error,
-        norm_L2_original,
-        norm_Linf_error,
-        norm_Linf_original,
+    hiopy approach: shrink non-vertical (horizontal/cell) dims first, last
+    spatial dim first (C-order); shrink vertical dims last.  When `dims`
+    is None or empty, fall back to plain last-dim-first.
+    """
+    ndim = len(shape)
+    if ndim <= 1:
+        return []
+    if dims is None or len(dims) != ndim:
+        return list(range(ndim - 1, 0, -1))
+
+    horizontal, vertical = [], []
+    for i, name in enumerate(dims):
+        if i == 0:
+            continue
+        (vertical if _is_vertical_like_dim(name) else horizontal).append(i)
+    # Within each group, shrink the LAST (fastest-varying) dim first.
+    return sorted(horizontal, reverse=True) + sorted(vertical, reverse=True)
+
+
+def _compute_inner_chunk_shape(
+    shape,
+    dtype,
+    dims,
+    target_bytes: int,
+    allow_spatial_split: bool = True,
+) -> Tuple[Tuple[int, ...], str]:
+    """
+    Core inner-chunk sizing algorithm.  Returns (inner_chunk_shape, mode).
+
+    `mode` is a short string describing which branch was taken, suitable
+    for logging:
+      - "leading-fits"   : one leading slice fits in target; chunked along leading
+      - "spatial-split"  : leading slice exceeds target; spatial dims also shrunk
+      - "temporal-only"  : leading slice exceeds target but split was disabled;
+                           one timestep per chunk, full spatial (may exceed target)
+
+    Algorithm:
+      1. If one slice along the leading dim fits in target, pack as many
+         leading slices as fit.
+      2. Else set leading=1.  If allow_spatial_split is False, return now
+         (caller is responsible for any oversize warning).
+      3. Else walk spatial dims in `_shrink_order` and reduce each until
+         the chunk fits in target_bytes.
+    """
+    itemsize = int(np.dtype(dtype).itemsize)
+    shape = tuple(int(s) for s in shape)
+    ndim = len(shape)
+    if ndim == 0:
+        return (), "leading-fits"
+
+    inner = list(shape)
+
+    # Bytes for one leading slice (the full trailing tile).
+    trailing = int(np.prod(shape[1:])) if ndim > 1 else 1
+    bytes_per_leading = itemsize * trailing
+    if bytes_per_leading == 0:
+        return tuple(inner), "leading-fits"
+
+    if bytes_per_leading <= target_bytes:
+        leading = max(1, target_bytes // bytes_per_leading)
+        inner[0] = int(min(shape[0], leading))
+        return tuple(int(x) for x in inner), "leading-fits"
+
+    # One leading slice already exceeds target.
+    inner[0] = 1
+
+    if not allow_spatial_split:
+        return tuple(int(x) for x in inner), "temporal-only"
+
+    # Walk spatial dims in hiopy shrink order, reducing each until we fit.
+    for axis in _shrink_order(shape, dims):
+        chunk_bytes = itemsize * int(np.prod(inner))
+        if chunk_bytes <= target_bytes:
+            break
+        per_row = chunk_bytes // inner[axis] if inner[axis] > 0 else chunk_bytes
+        if per_row <= 0:
+            inner[axis] = 1
+            continue
+        new_size = max(1, target_bytes // per_row)
+        inner[axis] = int(min(inner[axis], new_size))
+
+    return tuple(int(x) for x in inner), "spatial-split"
+
+
+def compute_chunk_shape_for_eval(
+    shape,
+    dtype,
+    target_mib: int = 16,
+    dims=None,
+    max_target_mib: int = 256,
+    allow_spatial_split: bool = True,
+):
+    """
+    Pick a chunk shape for in-memory evaluation.  Same algorithm as the
+    persist path so the measured compression ratio reflects production.
+
+    Parameters
+    ----------
+    shape, dtype : array geometry.
+    target_mib : soft target chunk size in MiB.
+    dims : tuple of dim names (used to keep vertical-like dims whole when
+           spatial splitting is needed).  Pass None to fall back to plain
+           last-dim-first ordering.
+    max_target_mib : hard ceiling in MiB.  Only relevant when
+           allow_spatial_split=False (in that case we may emit a chunk
+           larger than target_mib; we warn once if it also exceeds this
+           ceiling).  Caller is responsible for the warning.
+    allow_spatial_split : if False, never split spatial dims; keep
+           (1, ...full spatial...) and accept oversized chunks.
+    """
+    target_bytes = int(target_mib) * 2**20
+    inner, _mode = _compute_inner_chunk_shape(
+        shape, dtype, dims, target_bytes,
+        allow_spatial_split=allow_spatial_split,
     )
+    return inner
+
+
+def compute_chunk_and_shard_shape(
+    shape,
+    dtype,
+    inner_mib: int = 16,
+    shard_mib: int = 512,
+    dims=None,
+    max_inner_mib: int = 256,
+    allow_spatial_split: bool = True,
+) -> Tuple[Tuple[int, ...], Optional[Tuple[int, ...]]]:
+    """
+    Auto-compute (inner_chunk_shape, shard_shape) for zarr v3 sharding.
+
+    Returns
+    -------
+    (inner, shards) where:
+      - `inner` is the inner chunk shape.
+      - `shards` is the shard shape, OR `None` to signal "skip sharding"
+        (the inner chunk already meets/exceeds the shard target, so a
+        shard would bundle <= 1 chunk and add only index overhead).
+
+    Parameters
+    ----------
+    inner_mib : soft target for inner chunk (MiB).
+    shard_mib : soft target for shard (MiB).  Each shard must contain an
+                integer number of inner chunks on every axis.
+    dims : optional tuple of dim names.  When provided, vertical-like
+           dims are shrunk last (hiopy approach).
+    max_inner_mib : hard ceiling on inner chunk size in MiB.  Used by the
+           caller to decide whether to warn; the algorithm itself respects
+           inner_mib when allow_spatial_split=True.
+    allow_spatial_split : if True (default), spatial dims are split when a
+           single timestep already exceeds inner_mib.  If False, chunks
+           remain (1, ...full spatial...) and may exceed the target.
+    """
+    itemsize = int(np.dtype(dtype).itemsize)
+    inner_target_bytes = int(inner_mib) * 2**20
+    shard_target_bytes = int(shard_mib) * 2**20
 
-    (
-        norm_L1_error_val,
-        norm_L1_original_val,
-        norm_L2_error_val,
-        norm_L2_original_val,
-        norm_Linf_error_val,
-        norm_Linf_original_val,
-    ) = computed
-
-    relative_error_L1 = norm_L1_error_val / norm_L1_original_val
-    relative_error_L2 = norm_L2_error_val / norm_L2_original_val
-    relative_error_Linf = norm_Linf_error_val / norm_Linf_original_val
-
-    euclidean_distance = norm_L2_error_val
-    normalized_euclidean_distance = relative_error_L2
-
-    errors = {
-        "Relative_Error_L1": relative_error_L1,
-        "Relative_Error_L2": relative_error_L2,
-        "Relative_Error_Linf": relative_error_Linf,
-    }
-
-    errors_ = {k: f"{v:.3e}" for k, v in errors.items()}
-    return "\n".join(f"{k:20s}: {v}" for k, v in errors_.items()), errors, euclidean_distance, normalized_euclidean_distance
-
+    inner, _mode = _compute_inner_chunk_shape(
+        shape, dtype, dims, inner_target_bytes,
+        allow_spatial_split=allow_spatial_split,
+    )
 
+    # ---- shard shape ----
+    # Skip sharding when one chunk already fills (or exceeds) a shard:
+    # bundling a single chunk in a shard buys nothing and costs index bytes.
+    inner_bytes = itemsize * int(np.prod(inner))
+    if inner_bytes == 0 or inner_bytes >= shard_target_bytes:
+        return inner, None
+
+    multiplier = shard_target_bytes // inner_bytes
+    if multiplier <= 1:
+        return inner, None
+
+    shard = list(inner)
+    shard[0] = min(int(shape[0]), inner[0] * int(multiplier))
+    # Must be an integer multiple of inner[0].
+    shard[0] = (shard[0] // inner[0]) * inner[0]
+    shard[0] = max(shard[0], inner[0])
+    shard_t = tuple(int(x) for x in shard)
+    if shard_t == inner:
+        return inner, None
+    return inner, shard_t
+
+
+# -----------------------------------------------------------------------------
+# Kept from original: EBCC-specific chunk search
+# -----------------------------------------------------------------------------
 def compute_chunks(data, min_height=0, max_height=None, min_width=0, max_width=None):
     lat_dim = data.shape[0]
     lon_dim = data.shape[1]
     height = lat_dim
     width = lon_dim
 
-    if max_height is None: max_height = lat_dim
-    if max_width is None: max_width = lat_dim
+    if max_height is None:
+        max_height = lat_dim
+    if max_width is None:
+        max_width = lat_dim
 
     keep_searching_height = True
     keep_searching_width = True
 
     for n in [2, 3, 5]:
         for m in range(10):
-            d = n * (m+1)
+            d = n * (m + 1)
             for p in range(7, -1, -1):
-
                 if keep_searching_height or keep_searching_width:
-
                     if keep_searching_height:
-                        n_chunks_height = d**p
-
+                        n_chunks_height = d ** p
                         if height % n_chunks_height == 0:
                             new_height = height / n_chunks_height
-
                             if (new_height >= min_height) and (new_height <= max_height):
                                 height = new_height
                                 keep_searching_height = False
-
                     if keep_searching_width and p > 0:
-                        n_chunks_width = d**(p-1)
-
+                        n_chunks_width = d ** (p - 1)
                         if width % n_chunks_width == 0:
                             new_width = width / n_chunks_width
-
                             if (new_width >= min_width) and (new_width <= max_width):
                                 width = new_width
                                 keep_searching_width = False
-
                 else:
                     return (height, width, n_chunks_height, n_chunks_width)
 
+    # All loops exhausted without both dims finding a valid factoring.
+    # Raise rather than fall off the end (callers unpack into 4 names).
+    raise ValueError(
+        f"compute_chunks: no valid EBCC chunking found for shape "
+        f"({lat_dim}, {lon_dim}) under constraints "
+        f"height in [{min_height}, {max_height}], "
+        f"width in [{min_width}, {max_width}]. "
+        f"EBCC needs each dim divisible by (n*(m+1))^p for small n/m/p so the "
+        f"resulting block fits in the allowed range.  Resolved so far: "
+        f"height_done={not keep_searching_height}, "
+        f"width_done={not keep_searching_width}.  "
+        f"Workarounds: widen the min/max bounds, pad the field to a "
+        f"factor-friendly shape, or pick a non-EBCC serializer."
+    )
 
-def compressor_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=True, compressor_class="all"):
-    # https://numcodecs.readthedocs.io/en/stable/zarr3.html#compressors-bytes-to-bytes-codecs
-
-    compressor_space = []
 
-    _COMPRESSORS = [numcodecs.zarr3.Blosc, numcodecs.zarr3.LZ4, numcodecs.zarr3.Zstd, numcodecs.zarr3.Zlib, numcodecs.zarr3.GZip, numcodecs.zarr3.BZ2, numcodecs.zarr3.LZMA]
+# =============================================================================
+# CODEC SPACES
+# =============================================================================
+
+# ---- Compressor parameter grids (bytes -> bytes, data-independent) ----------
+_BLOSC_CNAMES      = ("lz4", "lz4hc", "zstd")               # dropped blosclz
+_BLOSC_CLEVELS     = (1, 5, 9)
+_BLOSC_SHUFFLES    = (0, 1)                                 # dropped shuffle=2 bit
+_LZ4_ACCELERATIONS = (1, 10, 100)
+_ZSTD_LEVELS       = (6, 12, 22)                            # dropped 1
+_ZLIB_LEVELS       = (3, 6, 9)                              # dropped 1
+_BZ2_LEVELS        = (3, 6, 9)                              # dropped 1
+_LZMA_PRESETS      = (3, 6, 9)                              # dropped 1
+
+# ---- Filter parameter grids (array -> array) --------------------------------
+# The top value in each tuple is effectively lossless
+# and acts as the upper-bound reference point.
+_BITROUND_KEEPBITS_F32 = (3, 7, 11, 13, 17, 23)             # dropped 5, 9, 15; 23 -> lossless
+_BITROUND_KEEPBITS_F64 = (3, 7, 11, 17, 23, 30, 37, 44, 52) # dropped 5, 9, 13 (mirroring f32 logic); 52 -> lossless
+_QUANTIZE_DIGITS_F32   = (1, 3, 4, 5, 6, 7)                 # dropped 2 (adjacent-redundant); 7 -> ~lossless
+_QUANTIZE_DIGITS_F64   = (1, 3, 4, 5, 6, 7, 9, 11, 13, 15)  # dropped 2 (adjacent-redundant); 15 -> ~lossless
+_ASINH_QUANTILE        = 0.01
+
+# ---- Serializer parameter grids (array -> bytes) ----------------------------
+_PCODEC_LEVELS         = (6, 8, 10, 12)      # dropped 4; dropped 0 ("no compression")
+_PCODEC_DELTA_ORDERS   = (0, 7)              # dropped 3 (middle); endpoints cover delta-mode space
+_ZFPY_K_GRID           = (0, 1, 2, 3)        # k -> compute_fixed_*_param(k);
+                                             # fixed-rate / fixed-precision: 8/16/32/64 bits
+                                             # fixed-accuracy:               0.5/0.25/0.0625/0.0039
+_EBCC_ATOLS            = (1e-2, 1e-3, 1e-6, 1e-9)
+# TODO(EBCC): atol is absolute and not scaled to the data range, so its
+# meaning changes wildly with variable units (1e-3 is fine-grained for
+# temperature in K, near-lossless for surface pressure in Pa, useless for
+# precipitation in kg/m^2/s).  Rescale to a relative tolerance or to a
+# fraction of the data's natural range before re-enabling EBCC in
+# production.  EBCC is gated off by --without-ebcc (the CLI default).
+
+
+def compressor_space(da, with_lossy=True, with_numcodecs_wasm=False,
+                     with_ebcc=False, compressor_class="all"):
+    """
+    Bytes->bytes compressor space.  Data-independent: the `da` argument is
+    accepted only for signature symmetry with filter_space / serializer_space,
+    and the lossy / wasm / ebcc flags are also ignored here (every codec in
+    this space is lossless).  Returns [(index, codec), ...].
+
+    Note: standalone GZip has been removed from the space.  GZip and Zlib
+    both run DEFLATE with different wrapping headers, so they produce
+    identical CR for identical input; keeping both was strict redundancy.
+    """
+    _COMPRESSORS = [
+        numcodecs.zarr3.Blosc, numcodecs.zarr3.LZ4, numcodecs.zarr3.Zstd,
+        numcodecs.zarr3.Zlib, numcodecs.zarr3.BZ2, numcodecs.zarr3.LZMA,
+    ]
     _COMPRESSOR_MAP = {cls.__name__.lower(): cls for cls in _COMPRESSORS}
 
+    space = []
     if compressor_class.lower() == "all":
-        pass  # use all compressors
+        pass
     elif compressor_class.lower() in _COMPRESSOR_MAP:
         _COMPRESSORS = [_COMPRESSOR_MAP[compressor_class.lower()]]
     elif compressor_class.lower() == "none":
         _COMPRESSORS = []
-        compressor_space.append(None)
-    else:
-        pass  # use all compressors
+        space.append(None)
 
     for compressor in _COMPRESSORS:
-        if compressor == numcodecs.zarr3.Blosc:
-            for cname in numcodecs.blosc.list_compressors():
-                for clevel in [1, 6, 9]:
-                    for shuffle in [-1, 0, 1, 2]:
-                        compressor_space.append(compressor(cname=cname, clevel=clevel, shuffle=shuffle))
-        elif compressor == numcodecs.zarr3.LZ4:
-            for acceleration in [1, 10, 100]:
-                compressor_space.append(compressor(acceleration=acceleration))
-        elif compressor == numcodecs.zarr3.Zstd:
-            for level in [0, 1, 9, 22]:
-                compressor_space.append(compressor(level=level))
-        elif compressor == numcodecs.zarr3.Zlib:
-            for level in [1, 6, 9]:
-                compressor_space.append(compressor(level=level))
-        elif compressor == numcodecs.zarr3.GZip:
-            for level in [1, 6, 9]:
-                compressor_space.append(compressor(level=level))
-        elif compressor == numcodecs.zarr3.BZ2:
-            for level in [1, 6, 9]:
-                compressor_space.append(compressor(level=level))
-        elif compressor == numcodecs.zarr3.LZMA:
-            for preset in [1, 6, 9]:
-                compressor_space.append(compressor(preset=preset))
-
-    return list(zip(range(len(compressor_space)), compressor_space))
-
-
-def filter_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=True, filter_class="all"):
-    # https://numcodecs.readthedocs.io/en/stable/zarr3.html#filters-array-to-array-codecs
-    # https://numcodecs-wasm.readthedocs.io/en/latest/
-
-    filter_space = []
-
+        if compressor is numcodecs.zarr3.Blosc:
+            for cname in _BLOSC_CNAMES:
+                for clevel in _BLOSC_CLEVELS:
+                    for shuffle in _BLOSC_SHUFFLES:
+                        space.append(compressor(cname=cname, clevel=clevel, shuffle=shuffle))
+        elif compressor is numcodecs.zarr3.LZ4:
+            for acceleration in _LZ4_ACCELERATIONS:
+                space.append(compressor(acceleration=acceleration))
+        elif compressor is numcodecs.zarr3.Zstd:
+            for level in _ZSTD_LEVELS:
+                space.append(compressor(level=level))
+        elif compressor is numcodecs.zarr3.Zlib:
+            for level in _ZLIB_LEVELS:
+                space.append(compressor(level=level))
+        elif compressor is numcodecs.zarr3.BZ2:
+            for level in _BZ2_LEVELS:
+                space.append(compressor(level=level))
+        elif compressor is numcodecs.zarr3.LZMA:
+            for preset in _LZMA_PRESETS:
+                space.append(compressor(preset=preset))
+
+    return list(enumerate(space))
+
+
+def filter_space(da, with_lossy=True, with_numcodecs_wasm=False,
+                 with_ebcc=False, filter_class="all"):
+    """
+    Array->array filter space.  Some filters are data-dependent:
+      * Asinh's linear_width comes from a quantile of |da|
+      * FixedOffsetScale's offset/scale come from sample mean/std/min/max
+    For integer dtypes only Delta is meaningful (BitRound/Quantize are
+    float-only; Asinh/FixedOffsetScale produce float output).  If the user
+    asks for a filter class that is incompatible with the dtype, we warn
+    explicitly rather than silently honouring the dtype override.
+
+    Returns [(index, codec), ...].
+    """
+    is_int = (da.dtype.kind == "i")
     _FILTERS = [numcodecs.zarr3.Delta]
     if with_lossy:
         _FILTERS += [numcodecs.zarr3.BitRound, numcodecs.zarr3.Quantize]
@@ -370,43 +895,53 @@ def filter_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=True,
         if with_lossy:
             _FILTERS.append(Asinh)
         _FILTERS.append(FixedOffsetScale)
-    if da.dtype.kind == 'i':
+    if is_int:
+        # Integer fields: only Delta is algorithmically meaningful.  Surface
+        # the override to the user instead of silently dropping their
+        # --filter-class request.
+        if filter_class.lower() not in ("all", "delta", "none"):
+            click.echo(
+                f"[filter_space] integer dtype {da.dtype}: only Delta is "
+                f"available; ignoring --filter-class={filter_class}.",
+                err=True,
+            )
         _FILTERS = [numcodecs.zarr3.Delta]
 
     _FILTER_MAP = {cls.__name__.lower(): cls for cls in _FILTERS}
 
+    space = []
     if filter_class.lower() == "all":
-        pass  # use all filters
+        pass
     elif filter_class.lower() in _FILTER_MAP:
         _FILTERS = [_FILTER_MAP[filter_class.lower()]]
     elif filter_class.lower() == "none":
         _FILTERS = []
-        filter_space.append(None)
-    else:
-        pass  # use all filters
+        space.append(None)
 
     for filt in _FILTERS:
-        if filt == numcodecs.zarr3.Delta:
+        if filt is numcodecs.zarr3.Delta:
             if np.issubdtype(da.dtype, np.number):
-                filter_space.append(filt(dtype=str(da.dtype)))
-        elif filt == numcodecs.zarr3.BitRound:
-            for keepbits in valid_keepbits_for_bitround(da, step=9):
-                filter_space.append(filt(keepbits=keepbits))
-        elif filt == numcodecs.zarr3.Quantize:
-            for digits in valid_digits_for_quantize(da, step=4):
-                filter_space.append(filt(digits=digits, dtype=str(da.dtype)))
-        elif filt == Asinh:
-            filter_space.append(AnyNumcodecsArrayArrayCodec(filt(linear_width=compute_linear_width(da, quantile=0.01, compute=True))))
-        elif filt == FixedOffsetScale:
-            # Compute required stats ONCE (global reductions; memory-light but do trigger a compute)
+                space.append(filt(dtype=str(da.dtype)))
+        elif filt is numcodecs.zarr3.BitRound:
+            for keepbits in valid_keepbits_for_bitround(da):
+                space.append(filt(keepbits=keepbits))
+        elif filt is numcodecs.zarr3.Quantize:
+            for digits in valid_digits_for_quantize(da):
+                space.append(filt(digits=digits, dtype=str(da.dtype)))
+        elif filt is Asinh:
+            space.append(
+                AnyNumcodecsArrayArrayCodec(
+                    filt(linear_width=compute_linear_width(
+                        da, quantile=_ASINH_QUANTILE, compute=True
+                    ))
+                )
+            )
+        elif filt is FixedOffsetScale:
             mean_val, std_val, min_val, max_val = dask.compute(
-                da.mean(skipna=True),
-                da.std(skipna=True),
-                da.min(skipna=True),
-                da.max(skipna=True),
+                da.mean(skipna=True), da.std(skipna=True),
+                da.min(skipna=True), da.max(skipna=True),
             )
 
-            # Helper to validate finite, nonzero scale (avoid divide-by-zero / NaNs)
             def _safe_scale(x, min_eps=1e-12):
                 if not np.isfinite(x):
                     return None
@@ -414,28 +949,36 @@ def _safe_scale(x, min_eps=1e-12):
                     return None
                 return float(x)
 
-            # Normalize: (x - mean) / std
             std_safe = _safe_scale(std_val)
             if np.isfinite(mean_val) and std_safe is not None:
-                filter_space.append(AnyNumcodecsArrayArrayCodec(filt(offset=float(mean_val), scale=std_safe)))
+                space.append(AnyNumcodecsArrayArrayCodec(
+                    filt(offset=float(mean_val), scale=std_safe)
+                ))
 
-            # Standardize to [0,1]-like: (x - min) / (max - min)
             rng = max_val - min_val
             rng_safe = _safe_scale(rng)
             if np.isfinite(min_val) and rng_safe is not None:
-                filter_space.append(AnyNumcodecsArrayArrayCodec(filt(offset=float(min_val), scale=rng_safe)))
+                space.append(AnyNumcodecsArrayArrayCodec(
+                    filt(offset=float(min_val), scale=rng_safe)
+                ))
 
-    return list(zip(range(len(filter_space)), filter_space))
+    return list(enumerate(space))
 
 
-def serializer_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=True, serializer_class="all"):
-    # https://numcodecs.readthedocs.io/en/stable/zarr3.html#serializers-array-to-bytes-codecs
-    # https://numcodecs-wasm.readthedocs.io/en/latest/
-    rank = MPI.COMM_WORLD.Get_rank()
-    is_int = (da.dtype.kind == "i")
+def serializer_space(da, with_lossy=True, with_numcodecs_wasm=False,
+                     with_ebcc=False, serializer_class="all"):
+    """
+    Array->bytes serializer space.  PCodec is always present.  ZFPY is added
+    when with_lossy=True.  EBCC and wasm-Zfp are gated on additional flags;
+    both default off in the CLI and are retained here only for explicit
+    reactivation (see EBCC TODO at the top of this section).
 
-    serializer_space = []
+    For integer dtypes only ZFPY's fixed-rate mode is meaningful; the other
+    two modes are skipped.
 
+    Returns [(index, codec), ...].
+    """
+    is_int = (da.dtype.kind == "i")
     _SERIALIZERS = [numcodecs.zarr3.PCodec]
     if with_lossy:
         _SERIALIZERS.append(numcodecs.zarr3.ZFPY)
@@ -446,30 +989,24 @@ def serializer_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=Tr
 
     _SERIALIZER_MAP = {cls.__name__.lower(): cls for cls in _SERIALIZERS}
 
+    space = []
     if serializer_class.lower() == "all":
-        pass  # use all serializers
+        pass
     elif serializer_class.lower() in _SERIALIZER_MAP:
         _SERIALIZERS = [_SERIALIZER_MAP[serializer_class.lower()]]
     elif serializer_class.lower() == "none":
         _SERIALIZERS = []
-        serializer_space.append(None)
-    else:
-        pass  # use all serializers
+        space.append(None)
 
     for serializer in _SERIALIZERS:
-        if serializer == numcodecs.zarr3.PCodec:
-            for level in [0, 4, 8, 12]:
-                for delta_encoding_order in [0, 3, 7]:
-                    serializer_space.append(serializer(
-                            level=level,
-                            mode_spec="auto",
-                            delta_spec="auto",
-                            delta_encoding_order=delta_encoding_order
-                        )
-                    )
+        if serializer is numcodecs.zarr3.PCodec:
+            for level in _PCODEC_LEVELS:
+                for delta_encoding_order in _PCODEC_DELTA_ORDERS:
+                    space.append(serializer(
+                        level=level, mode_spec="auto",
+                        delta_spec="auto", delta_encoding_order=delta_encoding_order,
+                    ))
         elif serializer in (numcodecs.zarr3.ZFPY, Zfp):
-            # https://github.com/LLNL/zfp/tree/develop/python
-            # https://github.com/LLNL/zfp/blob/develop/tests/python/test_numpy.py
             _ZFP_MODES = [
                 ("fixed-accuracy",  zfpy.mode_fixed_accuracy,  "tolerance", compute_fixed_accuracy_param),
                 ("fixed-precision", zfpy.mode_fixed_precision, "precision", compute_fixed_precision_param),
@@ -478,164 +1015,818 @@ def serializer_space(da, with_lossy=True, with_numcodecs_wasm=True, with_ebcc=Tr
             if is_int:
                 _ZFP_MODES = [m for m in _ZFP_MODES if m[0] == "fixed-rate"]
             for mode_str, zfpy_mode, param_name, param_fn in _ZFP_MODES:
-                for k in range(3):
+                for k in _ZFPY_K_GRID:
                     val = param_fn(k)
                     if serializer is numcodecs.zarr3.ZFPY:
-                        serializer_space.append(serializer(mode=zfpy_mode, **{param_name: val}))
+                        space.append(serializer(mode=zfpy_mode, **{param_name: val}))
                     else:
                         codec = serializer(mode=mode_str, **{param_name: val})
-                        serializer_space.append(AnyNumcodecsArrayBytesCodec(codec))
-        elif serializer == EBCCZarrFilter:
-            # https://github.com/spcl/ebcc
+                        space.append(AnyNumcodecsArrayBytesCodec(codec))
+        elif serializer is EBCCZarrFilter:
             data = da.squeeze()  # TODO: add more checks on the shape of the data
-
-            height, width, n_chunks_height, n_chunks_width = compute_chunks( data,
-                                                                             min_height=32,
-                                                                             max_height=2047,
-                                                                             min_width=32,
-                                                                             max_width=2047 )
-
-            # if rank == 0:
-            #     click.echo(f"Using (lat_chunks * lon_chunks) = ({n_chunks_height} * {n_chunks_width}) = {n_chunks_height*n_chunks_width} chunks for EBCC serializers.")
-
-            for atol in [1e-2, 1e-3, 1e-6, 1e-9]:
+            height, width, n_chunks_height, n_chunks_width = compute_chunks(
+                data, min_height=32, max_height=2047, min_width=32, max_width=2047
+            )
+            for atol in _EBCC_ATOLS:
                 ebcc_filter = EBCC_Filter(
-                        base_cr=2,
-                        height=height,
-                        width=width,
-                        data_dim=len(data.shape),
-                        residual_opt=("max_error_target", atol)
-                    )
+                    base_cr=2, height=height, width=width,
+                    data_dim=len(data.shape),
+                    residual_opt=("max_error_target", atol),
+                )
                 zarr_filter = serializer(ebcc_filter.hdf_filter_opts)
-                serializer_space.append(AnyNumcodecsArrayBytesCodec(zarr_filter))
+                space.append(AnyNumcodecsArrayBytesCodec(zarr_filter))
 
-    return list(zip(range(len(serializer_space)), serializer_space))
+    return list(enumerate(space))
 
 
-def valid_keepbits_for_bitround(xr_dataarray, step=1):
+def valid_keepbits_for_bitround(xr_dataarray):
+    """Return the BitRound keepbits grid for the dtype of `xr_dataarray`."""
     dtype = xr_dataarray.dtype
     if np.issubdtype(dtype, np.float64):
-        return inclusive_range(1, 52, step)  # float64 mantissa is 52 bits
+        return _BITROUND_KEEPBITS_F64
     elif np.issubdtype(dtype, np.float32):
-        return inclusive_range(1, 23, step)  # float32 mantissa is 23 bits
+        return _BITROUND_KEEPBITS_F32
     else:
-        raise TypeError(f"Unsupported dtype '{dtype}'. BitRound only supports float32 and float64.")
+        raise TypeError(
+            f"Unsupported dtype '{dtype}'. BitRound only supports float32 and float64."
+        )
 
 
-def valid_digits_for_quantize(xr_dataarray, step=1):
+def valid_digits_for_quantize(xr_dataarray):
+    """Return the Quantize digits grid for the dtype of `xr_dataarray`."""
     dtype = xr_dataarray.dtype
     if np.issubdtype(dtype, np.float64):
-        return inclusive_range(1, 15, step)  # ~15–16 significant digits for float64
+        return _QUANTIZE_DIGITS_F64
     elif np.issubdtype(dtype, np.float32):
-        return inclusive_range(1, 7, step)   # ~7 significant digits for float32
+        return _QUANTIZE_DIGITS_F32
     else:
-        raise TypeError(f"Unsupported dtype '{dtype}'. Quantize only supports float32 and float64.")
-
-
-def compute_fixed_precision_param(param: int) -> int:
-    # https://github.com/LLNL/zfp/tree/develop/tests/python
-    return 1 << (param + 3)
+        raise TypeError(
+            f"Unsupported dtype '{dtype}'. Quantize only supports float32 and float64."
+        )
 
-def compute_fixed_rate_param(param: int) -> int:
-    # https://github.com/LLNL/zfp/tree/develop/tests/python
-    return 1 << (param + 3)
 
-def compute_fixed_accuracy_param(param: int) -> float:
-    # https://github.com/LLNL/zfp/tree/develop/tests/python
-    return math.ldexp(1.0, -(1 << param))
+def compute_fixed_precision_param(param: int) -> int:  return 1 << (param + 3)
+def compute_fixed_rate_param(param: int) -> int:       return 1 << (param + 3)
+def compute_fixed_accuracy_param(param: int) -> float: return math.ldexp(1.0, -(1 << param))
 
 
 def inclusive_range(start, end, step=1):
     if step == 0:
         raise ValueError("step must not be zero")
-
     values = []
     i = start
     if step > 0:
         while i <= end:
-            values.append(i)
-            i += step
+            values.append(i); i += step
         if values[-1] != end:
             values.append(end)
     else:
         while i >= end:
-            values.append(i)
-            i += step
+            values.append(i); i += step
         if values[-1] != end:
             values.append(end)
-
     return values
 
 
-def validate_percentage(ctx, param, value):
-    if value is None:
-        return None
-    try:
-        value = float(value)
-    except ValueError:
-        raise click.BadParameter("Percentage must be a number.")
-    if not (1 <= value <= 99):
-        raise click.BadParameter("Percentage must be between 1 and 99.")
-    return value
+def compute_linear_width(
+    da: xr.DataArray, *,
+    quantile: float = 0.01, skipna: bool = True,
+    floor: float | None = None, cap: float | None = None,
+    compute: bool = False,
+) -> float | xr.DataArray:
+    finite = xr.apply_ufunc(np.isfinite, da, dask="parallelized")
+    abs_da = xr.apply_ufunc(np.abs, da.where(finite), dask="parallelized")
+    lw = abs_da.quantile(quantile, skipna=skipna)
+    if "quantile" in lw.dims:
+        lw = lw.squeeze("quantile", drop=True)
+    if floor is not None or cap is not None:
+        lw = lw.clip(
+            min=floor if floor is not None else None,
+            max=cap   if cap   is not None else None,
+        )
+    return float(lw.compute()) if compute else lw
 
 
-def slice_array(arr: pd.array, indices_ls: list) -> np.ndarray:
-    arr_ls = []
-    for ind in indices_ls:
-        arr_ls.append(arr[[ind]])
+# =============================================================================
+# CODEC PIPELINE ASSEMBLY
+# =============================================================================
+
+def _codec_kwargs(filters, compressors, serializer):
+    """
+    Build the keyword arguments that describe a codec pipeline for
+    zarr.create_array / Group.create_array (forwarded through
+    dask.array.to_zarr as **zarr_array_kwargs).
+
+    Zarr v3's user-facing API takes the three components as SEPARATE kwargs:
+      - filters=    list of array->array codecs, or None
+      - serializer= a single array->bytes codec, or "auto" to use the default
+      - compressors=list of bytes->bytes codecs, or None
+
+    We only include a key when the caller has a value for it, so zarr's own
+    defaults apply when a component is missing.
+
+    `serializer == "auto"` is a sentinel from the CLI meaning "let zarr pick";
+    we pass it through literally because zarr accepts "auto" as a valid value.
+    """
+    kwargs = {}
+    if filters is not None:
+        kwargs["filters"] = filters
+    if compressors is not None:
+        kwargs["compressors"] = compressors
+    if serializer is not None:
+        kwargs["serializer"] = serializer
+    return kwargs
+
 
-    sliced_arr = np.hstack(
-        tuple(arr_ls)
+# =============================================================================
+# CODEC PIPELINE - EVALUATION (no persistence, thread-safe)
+# =============================================================================
+
+def _info_bytes(info):
+    """
+    Return (count_bytes, count_bytes_stored) from a zarr ArrayInfo.
+
+    Prefers the public attributes exposed by recent zarr versions, but falls
+    back to the private underscore-prefixed names for older ones.  Protects
+    against a future zarr upgrade renaming / removing the private fields.
+    """
+    count = getattr(info, "count_bytes", None)
+    if count is None:
+        count = info._count_bytes
+    stored = getattr(info, "count_bytes_stored", None)
+    if stored is None:
+        stored = info._count_bytes_stored
+    return int(count), int(stored)
+
+
+def _iter_chunk_slices(shape, chunk_shape):
+    """Yield tuples of slice() covering `shape` in `chunk_shape` steps."""
+    ranges = [range(0, s, c) for s, c in zip(shape, chunk_shape)]
+    for start in product(*ranges):
+        yield tuple(
+            slice(st, min(st + c, s))
+            for st, c, s in zip(start, chunk_shape, shape)
+        )
+
+
+# =============================================================================
+# ZARR SYNC-API BYPASS  (opt-in via cli --bypass-zarr-sync)
+# =============================================================================
+# zarr 3's sync wrapper (zarr.core.sync.sync) runs every coroutine on a
+# process-global event loop, serialising codec calls from concurrent worker
+# threads down to ~1 effective core (measured 5x slowdown at 1x32 vs 32x1).
+# We bypass it by calling zarr.api.asynchronous.create_array directly, with
+# a persistent event loop per worker thread.
+#
+# A single shared bounded ThreadPoolExecutor is wired as the default
+# executor on every per-thread loop.  Without that, asyncio.to_thread()
+# inside zarr's native codecs lazily creates a 32-worker default executor
+# per loop -> 32 user threads x 32 workers = ~1024 OS threads (validation
+# job 844391: 30 GB RAM, AveCPU/wall = 2.1).  The shared executor caps
+# total OS threads at user_threads + shared_workers.
+
+try:
+    from zarr.api.asynchronous import create_array as _zarr_async_create_array
+    _ASYNC_BYPASS_AVAILABLE = True
+except ImportError:
+    _zarr_async_create_array = None
+    _ASYNC_BYPASS_AVAILABLE = False
+
+
+_thread_local_loops = threading.local()
+_shared_executor = None
+_shared_executor_lock = threading.Lock()
+
+
+def _get_or_create_shared_executor(max_workers: int):
+    """Lazy, thread-safe singleton ThreadPoolExecutor."""
+    global _shared_executor
+    if _shared_executor is None:
+        with _shared_executor_lock:
+            if _shared_executor is None:
+                from concurrent.futures import ThreadPoolExecutor
+                _shared_executor = ThreadPoolExecutor(
+                    max_workers=max(1, int(max_workers)),
+                    thread_name_prefix="bypass_codec",
+                )
+    return _shared_executor
+
+
+def _shutdown_shared_executor() -> None:
+    global _shared_executor
+    with _shared_executor_lock:
+        if _shared_executor is not None:
+            _shared_executor.shutdown(wait=False, cancel_futures=False)
+            _shared_executor = None
+
+
+def _get_thread_event_loop() -> asyncio.AbstractEventLoop:
+    """
+    Return this thread's persistent asyncio loop, creating it on first call.
+    The shared bounded executor is bound as default executor on creation;
+    without this, asyncio.to_thread() inside zarr's native codecs spawns a
+    32-worker default executor PER per-thread loop.
+    """
+    loop = getattr(_thread_local_loops, "loop", None)
+    if loop is None or loop.is_closed():
+        loop = asyncio.new_event_loop()
+        asyncio.set_event_loop(loop)
+        if _shared_executor is not None:
+            loop.set_default_executor(_shared_executor)
+        _thread_local_loops.loop = loop
+    return loop
+
+
+class _AsyncBypass:
+    """Toggle for the async-direct codec dispatch (cli --bypass-zarr-sync)."""
+    enabled: bool = False
+    threads_per_rank: int = 1
+
+    @classmethod
+    def enable(cls, threads_per_rank: int = 1) -> None:
+        if not _ASYNC_BYPASS_AVAILABLE:
+            raise RuntimeError(
+                "Cannot enable --bypass-zarr-sync: "
+                "zarr.api.asynchronous.create_array is not importable. "
+                "Upgrade zarr or run without the flag."
+            )
+        cls.enabled = True
+        cls.threads_per_rank = max(1, int(threads_per_rank))
+        _get_or_create_shared_executor(cls.threads_per_rank)
+
+    @classmethod
+    def disable(cls) -> None:
+        cls.enabled = False
+        _shutdown_shared_executor()
+
+
+AsyncBypass = _AsyncBypass
+
+
+async def _zarr_pipeline_async(sample_np, dims, codec_kwargs, chunks):
+    """async create + encode + info + decode."""
+    store = zarr.storage.MemoryStore()
+
+    with Timer("eval.create_array"):
+        z = await _zarr_async_create_array(
+            store=store,
+            name="_tmp_eval",
+            shape=sample_np.shape,
+            dtype=sample_np.dtype,
+            chunks=chunks,
+            zarr_format=3,
+            dimension_names=tuple(dims),
+            **codec_kwargs,
+        )
+
+    with Timer("eval.encode"):
+        await z.setitem(Ellipsis, sample_np)
+
+    with Timer("eval.info_complete"):
+        info = await z.info_complete()
+        count_bytes, count_bytes_stored = _info_bytes(info)
+        ratio = count_bytes / count_bytes_stored
+
+    with Timer("eval.decode"):
+        decomp_full = await z.getitem(Ellipsis)
+
+    return decomp_full, ratio
+
+
+def _zarr_pipeline_sync(sample_np, dims, codec_kwargs, chunks):
+    """sync create + encode + info + decode."""
+    store = zarr.storage.MemoryStore()
+
+    with Timer("eval.create_array"):
+        z = zarr.create_array(
+            store=store,
+            name="_tmp_eval",
+            shape=sample_np.shape,
+            dtype=sample_np.dtype,
+            chunks=chunks,
+            zarr_format=3,
+            dimension_names=tuple(dims),
+            **codec_kwargs,
+        )
+
+    with Timer("eval.encode"):
+        z[...] = sample_np
+
+    with Timer("eval.info_complete"):
+        info = z.info_complete()
+        count_bytes, count_bytes_stored = _info_bytes(info)
+        ratio = count_bytes / count_bytes_stored
+
+    with Timer("eval.decode"):
+        decomp_full = z[...]
+
+    return decomp_full, ratio
+
+
+# =============================================================================
+# CODEC PIPELINE - EVALUATION (no persistence, thread-safe)
+# =============================================================================
+
+
+def evaluate_codec_pipeline(
+    sample_np: np.ndarray,
+    dims,
+    filters,
+    compressors,
+    serializer,
+    chunks,
+):
+    """
+    Measure (compression_ratio, errors, euclidean_distance) for a codec
+    pipeline against `sample_np`.  In-memory zarr store, no I/O.  Safe to
+    call concurrently from multiple threads.
+
+    The full sample is decoded once via z[...]; the metrics loop slices
+    that buffer chunk-wise to bound the float64 promotion peak.  Peak
+    per-rank memory is ~2 * sample_nbytes (original + decoded) — caller
+    must size NTASKS_PER_NODE accordingly.
+
+    Dispatch: if AsyncBypass is enabled, the zarr operations route through
+    a per-thread persistent event loop (sidesteps zarr 3's process-global
+    sync() loop that otherwise serialises codec dispatch from concurrent
+    threads).  The metrics phase is identical in both paths.
+    """
+    codec_kwargs = _codec_kwargs(filters, compressors, serializer)
+
+    if _AsyncBypass.enabled:
+        loop = _get_thread_event_loop()
+        decomp_full, ratio = loop.run_until_complete(
+            _zarr_pipeline_async(sample_np, dims, codec_kwargs, chunks)
+        )
+    else:
+        decomp_full, ratio = _zarr_pipeline_sync(
+            sample_np, dims, codec_kwargs, chunks
+        )
+
+    # Chunk-wise error accumulation: bounds the float64 promotion peak at
+    # one chunk's worth.  np.errstate silences NaN-cast RuntimeWarnings;
+    # we detect non-finite results explicitly below and surface them via
+    # CombinationProducedNonFiniteError so the caller can record the combo
+    # cleanly in failures_*.csv.
+    with Timer("eval.metrics"):
+        l1_err = 0.0; l2_err_sq = 0.0; linf_err = 0.0
+        l1_ori = 0.0; l2_ori_sq = 0.0; linf_ori = 0.0
+
+        with np.errstate(invalid="ignore"):
+            for sl in _iter_chunk_slices(sample_np.shape, chunks):
+                orig = sample_np[sl]
+                decomp = decomp_full[sl]
+                err = decomp.astype(np.float64, copy=False) - orig.astype(np.float64, copy=False)
+                ori_abs = np.abs(orig, dtype=np.float64) if orig.dtype.kind in "fc" \
+                          else np.abs(orig.astype(np.float64))
+                err_abs = np.abs(err)
+
+                l1_err     += float(err_abs.sum())
+                l2_err_sq  += float((err * err).sum())
+                linf_err    = max(linf_err, float(err_abs.max(initial=0.0)))
+
+                l1_ori     += float(ori_abs.sum())
+                l2_ori_sq  += float((ori_abs * ori_abs).sum())
+                linf_ori    = max(linf_ori, float(ori_abs.max(initial=0.0)))
+
+        del decomp_full
+
+        # NaN/inf in the decoded sample propagates into the accumulators;
+        # one scalar isfinite() check at the end is enough.
+        if not (math.isfinite(l1_err) and math.isfinite(l2_err_sq)
+                and math.isfinite(linf_err)):
+            raise CombinationProducedNonFiniteError(
+                f"decoded sample contains non-finite values "
+                f"(l1_err={l1_err}, l2_err_sq={l2_err_sq}, linf_err={linf_err}); "
+                f"codec config produced NaN or +/-inf and is unsuitable for "
+                f"this field's value range"
+            )
+
+    l2_err = math.sqrt(l2_err_sq)
+    l2_ori = math.sqrt(l2_ori_sq)
+
+    def _safe_div(a, b):
+        return float(a) / float(b) if b != 0 else float("inf")
+
+    errors = {
+        "Relative_Error_L1":   _safe_div(l1_err,  l1_ori),
+        "Relative_Error_L2":   _safe_div(l2_err,  l2_ori),
+        "Relative_Error_Linf": _safe_div(linf_err, linf_ori),
+    }
+
+    return ratio, errors, l2_err
+
+
+# =============================================================================
+# CODEC PIPELINE - PERSISTENCE (writes to a shared LocalStore, with sharding)
+# =============================================================================
+
+def persist_with_codec_pipeline(
+    da,                           # xarray DataArray (dask-backed)
+    store,                        # zarr.storage.LocalStore
+    component: str,               # array name inside the store (== field name)
+    filters,
+    compressors,
+    serializer,
+    inner_chunks=None,
+    shards=None,
+    verify: bool = True,
+    verbose: bool = True,
+    rank: int = 0,
+):
+    """
+    Write `da` into `store` at `component` using the codec pipeline.
+    Returns (compression_ratio, errors, euclidean_distance).
+
+    - If `shards` is given: Dask chunks are rechunked to the shard shape so
+      each Dask task writes exactly one shard (no write-amplification).
+    - If `shards` is None: sharding is skipped entirely (the `shards=` kwarg
+      is NOT passed to zarr.create_array).  This happens automatically when
+      one inner chunk already meets or exceeds the shard target -- a shard
+      would bundle <= 1 chunk and add only index overhead.  Dask is then
+      rechunked to the inner chunk shape (each task writes one chunk).
+    - Uses `overwrite=True` as the dask-level kwarg (replaces the deprecated
+      v2-era `mode='w'` shape).  `chunks`, `shards`, codec kwargs,
+      `dimension_names`, `zarr_format=3` are passed through **zarr_array_kwargs
+      and forwarded by dask to zarr.create_array.  `mode=` is NOT accepted by
+      zarr v3's create_array -- it's a storage-level concept, not an array one.
+    """
+    assert isinstance(da.data, dask.array.Array), \
+        "persist_with_codec_pipeline expects a dask-backed xr.DataArray"
+
+    # Auto-size chunks/shards if not provided.
+    if inner_chunks is None or shards is None:
+        auto_inner, auto_shard = compute_chunk_and_shard_shape(
+            da.shape, da.dtype, dims=tuple(da.dims),
+        )
+        if inner_chunks is None:
+            inner_chunks = auto_inner
+        if shards is None:
+            shards = auto_shard  # may itself be None -> skip sharding
+
+    codec_kwargs = _codec_kwargs(filters, compressors, serializer)
+
+    # Pick the dask write unit:
+    #   - With sharding: one task per shard (avoids partial-shard rewrites).
+    #   - Without sharding: one task per inner chunk.
+    write_unit = shards if shards is not None else inner_chunks
+    dask_arr = da.data.rechunk(write_unit)
+
+    zarr_kwargs = dict(
+        zarr_format=3,
+        dimension_names=tuple(da.dims),
+        chunks=inner_chunks,
+        **codec_kwargs,
     )
-    return sliced_arr
+    if shards is not None:
+        zarr_kwargs["shards"] = shards
+
+    with Timer("dask.array.to_zarr"):
+        dask.array.to_zarr(
+            dask_arr,
+            store,
+            component=component,
+            overwrite=True,     # dask-level kwarg; NOT forwarded to zarr.create_array
+            compute=True,
+            **zarr_kwargs,
+        )
+
+    # Reopen the written array and compute stats.  Read-only by design:
+    # only info_complete() and the verification read follow, neither writes.
+    group = zarr.open_group(store, mode="r")
+    z = group[component]
+    info = z.info_complete()
+    count_bytes, count_bytes_stored = _info_bytes(info)
+    ratio = count_bytes / count_bytes_stored
+    if verbose and rank == 0:
+        click.echo("-" * 80)
+        click.echo(info)
+
+    errors = None
+    euclidean_distance = None
+    if verify:
+        with Timer("compute_errors_distances"):
+            # Load back with shard-aligned (or inner-aligned) chunks for
+            # efficient reads.
+            z_dask = dask.array.from_zarr(z, chunks=write_unit)
+            _pprint, errors, euclidean_distance, _nrm = \
+                compute_errors_distances(z_dask, da.data)
+        if verbose and rank == 0:
+            click.echo("-" * 80)
+            click.echo(_pprint)
+            click.echo("-" * 80)
+            click.echo(f"Euclidean Distance: {euclidean_distance}")
+            click.echo("-" * 80)
+
+    return ratio, errors, euclidean_distance
+
+
+# =============================================================================
+# ERROR METRICS  (used by the persist path; dask-lazy)
+# =============================================================================
+
+def compute_errors_distances(da_compressed, da):
+    da_error = da_compressed - da
+
+    norm_L1_error    = np.abs(da_error).sum()
+    norm_L2_error    = np.sqrt((da_error ** 2).sum())
+    norm_Linf_error  = np.abs(da_error).max()
 
+    norm_L1_original   = np.abs(da).sum()
+    norm_L2_original   = np.sqrt((da ** 2).sum())
+    norm_Linf_original = np.abs(da).max()
+
+    computed = dask.compute(
+        norm_L1_error, norm_L1_original,
+        norm_L2_error, norm_L2_original,
+        norm_Linf_error, norm_Linf_original,
+    )
+    (l1e, l1o, l2e, l2o, linfe, linfo) = computed
+
+    def _safe_rel(err, ori):
+        """Relative error with sane behavior for zero-norm originals.
+        Returns 0.0 when both error and original are zero (trivially correct),
+        inf when error is non-zero but original is zero."""
+        if ori == 0:
+            return 0.0 if err == 0 else float("inf")
+        return float(err) / float(ori)
+
+    relative_error_L1   = _safe_rel(l1e,    l1o)
+    relative_error_L2   = _safe_rel(l2e,    l2o)
+    relative_error_Linf = _safe_rel(linfe,  linfo)
+
+    euclidean_distance = l2e
+    normalized_euclidean_distance = relative_error_L2
+
+    errors = {
+        "Relative_Error_L1":   relative_error_L1,
+        "Relative_Error_L2":   relative_error_L2,
+        "Relative_Error_Linf": relative_error_Linf,
+    }
+    errors_ = {k: f"{v:.3e}" for k, v in errors.items()}
+    return (
+        "\n".join(f"{k:20s}: {v}" for k, v in errors_.items()),
+        errors,
+        euclidean_distance,
+        normalized_euclidean_distance,
+    )
+
+
+# =============================================================================
+# PARALLELISM & TOPOLOGY
+# =============================================================================
+
+def detect_node_topology(comm=None):
+    """
+    Return (node_comm, ranks_on_node, local_rank).
+
+    Uses the MPI-3 standard COMM_TYPE_SHARED split, with a hostname fallback
+    for old MPI implementations that don't support it.  Works generically on
+    any system without assumptions about cores, RAM, or cluster shape.
+    """
+    comm = comm or MPI.COMM_WORLD
+    try:
+        node_comm = comm.Split_type(MPI.COMM_TYPE_SHARED, key=comm.Get_rank())
+    except Exception:
+        import socket
+        node_name = socket.gethostname()
+        all_names = comm.allgather(node_name)
+        color_map = {n: i for i, n in enumerate(sorted(set(all_names)))}
+        node_comm = comm.Split(color_map[node_name], key=comm.Get_rank())
+    return node_comm, node_comm.Get_size(), node_comm.Get_rank()
 
-def unzip_file(zip_path: str, extract_to: str = None):
-    if extract_to is None:
-        extract_to = os.path.splitext(zip_path)[0]  # default: same name as zip
 
-    # Remove the extract_to path if it exists
-    if os.path.exists(extract_to):
-        if os.path.isfile(extract_to):
-            os.remove(extract_to)
+def detect_cores_available() -> int:
+    """Respect cgroups / Slurm cpusets where possible."""
+    if hasattr(os, "sched_getaffinity"):
+        try:
+            return max(1, len(os.sched_getaffinity(0)))
+        except Exception:
+            pass
+    return max(1, os.cpu_count() or 1)
+
+
+def compute_default_threads_per_rank(ranks_on_node: int, cores_available: int | None = None) -> int:
+    if cores_available is None:
+        cores_available = detect_cores_available()
+    return max(1, cores_available // max(1, ranks_on_node))
+
+
+def broadcast_numpy(arr, comm=None, root: int = 0) -> np.ndarray:
+    """
+    Broadcast a numpy array from `root` to all ranks using the MPI buffer
+    protocol (Bcast, uppercase).  This is O(log N_ranks) in wall-time for
+    large payloads, versus O(N_ranks) for the pickle-based `comm.bcast`.
+
+    Usage
+    -----
+    Rank `root` passes the real array; non-root ranks pass None and receive
+    the broadcast array as the return value:
+
+        if rank == 0:
+            payload = expensive_read()
         else:
-            shutil.rmtree(extract_to)
+            payload = None
+        payload = broadcast_numpy(payload, comm=comm, root=0)
+
+    Implementation
+    --------------
+    Done in two phases:
+      1. Broadcast (shape, dtype) via pickle — a few bytes, cheap.
+      2. Allocate a matching buffer on non-root ranks and Bcast the raw bytes.
+
+    This keeps the fast path (step 2) off of pickle, which matters once the
+    payload is >O(100 MB) - the whole point of using Bcast instead of bcast.
+
+    Notes
+    -----
+    - Non-contiguous inputs are made contiguous on root before broadcast;
+      the returned array is always C-contiguous.
+    - The dtype is round-tripped through str(dtype) / np.dtype(str), which
+      covers all the standard numpy dtypes used in this toolkit.  Custom
+      structured dtypes are not supported.
+    - Payload ceiling: MPI_Bcast's `count` argument is a C `int`
+      (max 2^31 - 1).  We let mpi4py pick the MPI type from the numpy
+      dtype (MPI.DOUBLE for float64, MPI.FLOAT for float32, etc.) instead
+      of wrapping as MPI.BYTE.  This means `count` is `numel`, not
+      `numel * itemsize`, so the ceiling scales with the element size:
+      ~16 GB for float64, ~8 GB for float32.  An earlier implementation
+      used `[buf, MPI.BYTE]` which silently capped at ~2 GB and would
+      overflow or corrupt payloads at the toolkit's default 5 GB sample
+      budget.  The wire protocol is identical; only the MPI type handle
+      changes.
+    """
+    comm = comm or MPI.COMM_WORLD
+    rank = comm.Get_rank()
 
-    # Extract zip file
-    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
-        zip_ref.extractall(extract_to)
+    # Phase 1: metadata (tiny; pickle is fine).
+    if rank == root:
+        if arr is None:
+            raise ValueError(
+                "broadcast_numpy: root rank must provide a numpy array, got None."
+            )
+        meta = (tuple(arr.shape), str(arr.dtype))
+    else:
+        meta = None
+    shape, dtype_str = comm.bcast(meta, root=root)
+
+    # Phase 2: contiguous buffer + Bcast on the buffer protocol.
+    # Passing `buf` bare lets mpi4py infer the MPI type from the numpy
+    # dtype.  Do NOT wrap as `[buf, MPI.BYTE]` - that reduces the effective
+    # payload ceiling by a factor of itemsize (see docstring).
+    if rank == root:
+        buf = np.ascontiguousarray(arr)
+    else:
+        buf = np.empty(shape, dtype=np.dtype(dtype_str))
+    comm.Bcast(buf, root=root)
+    return buf
 
-    return extract_to
 
+def check_thread_oversubscription(abort_if_unsafe: bool = True, rank: int = 0, comm=None) -> None:
+    """
+    Warn if codec-internal thread counts aren't pinned to 1.  Pin zarr v3's
+    internal threadpool regardless (safe no-op on older zarr versions).
 
-def copy_folder_contents(src_folder: str, dst_folder: str):
-    os.makedirs(dst_folder, exist_ok=True)
+    If abort_if_unsafe is True (the default) and any env var is misconfigured,
+    all ranks are killed collectively via comm.Abort(1).  This is intentional:
+    sys.exit on rank 0 alone would hang the siblings at the next collective.
+    """
+    comm = comm or MPI.COMM_WORLD
+    env_vars = [
+        "OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
+        "BLOSC_NTHREADS", "NUMBA_NUM_THREADS",
+        "VECLIB_MAXIMUM_THREADS", "OMP_THREAD_LIMIT",
+    ]
+    problems = []
+    for v in env_vars:
+        val = os.environ.get(v)
+        if val is None:
+            problems.append(f"{v}=<unset>")
+        else:
+            try:
+                if int(val) != 1:
+                    problems.append(f"{v}={val}")
+            except ValueError:
+                problems.append(f"{v}={val}")
 
-    for item in os.listdir(src_folder):
-        src_path = os.path.join(src_folder, item)
-        dst_path = os.path.join(dst_folder, item)
+    if problems:
+        if rank == 0:
+            click.echo(
+                "[oversubscription-check] WARNING: codec-internal thread "
+                "variables not pinned to 1:"
+            )
+            for p in problems:
+                click.echo(f"  - {p}")
+            click.echo(
+                "  With thread-per-combo parallelism this causes "
+                "N_threads x M_internal oversubscription."
+            )
+            click.echo(
+                "  Suggested: export OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 "
+                "OPENBLAS_NUM_THREADS=1 BLOSC_NTHREADS=1 NUMBA_NUM_THREADS=1 "
+                "VECLIB_MAXIMUM_THREADS=1 OMP_THREAD_LIMIT=1"
+            )
+            if abort_if_unsafe:
+                click.echo("  Aborting (use --no-oversubscription-check to override).")
+        if abort_if_unsafe:
+            # Collective abort: all ranks die, not just rank 0.  sys.exit on
+            # rank 0 alone would leave siblings hanging at the next collective.
+            comm.Abort(1)
+
+    # Pin zarr v3's internal thread pool only when oversubscription is a real
+    # risk: multi-rank-per-node (each rank's process otherwise spawns its own
+    # default executor of ~32 workers, giving N_ranks * 32 threads on a
+    # N-core node).  With 1 rank-per-node, no pin is needed: a single rank
+    # uses one ~32-worker pool, which matches the 32 cores it's been given.
+    # The bypass case has its own bounded shared executor — also no pin.
+    if not _AsyncBypass.enabled:
+        try:
+            _, ranks_on_node, _ = detect_node_topology(MPI.COMM_WORLD)
+            if ranks_on_node > 1:
+                zarr.config.set({"threading.max_workers": 1})
+        except Exception:
+            pass
+
+
+# =============================================================================
+# MISC UTILITIES  (kept from original)
+# =============================================================================
 
-        if os.path.isdir(src_path):
-            shutil.copytree(src_path, dst_path, dirs_exist_ok=True)
+def get_indexes(arr, indices):
+    codec_to_id = []
+    for ind in indices:
+        codec_to_id.append(ind[1:-1].split(", ", 1))
+    id_ls = []
+    codec_id_dict = {key: val for val, key in codec_to_id}
+    for item in arr:
+        if item == "None":
+            id_ls.append(-1)
+        elif item in list(codec_id_dict.keys()):
+            id_ls.append(codec_id_dict[item])
         else:
-            shutil.copy2(src_path, dst_path)
+            if "EBCC" in item:
+                fetch_new_idx = [value for key, value in codec_id_dict.items() if "EBCC" in key][0]
+                id_ls.append(fetch_new_idx)
+            else:
+                return IndexError(f"{item} not in list {list(codec_id_dict.keys())}")
+    return np.asarray(id_ls)
+
+
+def slice_array(arr: pd.array, indices_ls: list) -> np.ndarray:
+    arr_ls = [arr[[ind]] for ind in indices_ls]
+    return np.hstack(tuple(arr_ls))
 
 
-def progress_bar(i, total_configs, print_every=100, bar_width=40):
+def validate_percentage(ctx, param, value):
+    if value is None:
+        return None
+    try:
+        value = float(value)
+    except ValueError:
+        raise click.BadParameter("Percentage must be a number.")
+    if not (1 <= value <= 99):
+        raise click.BadParameter("Percentage must be between 1 and 99.")
+    return value
+
+
+# =============================================================================
+# PROGRESS BAR  (thread-safe)
+# =============================================================================
+
+_PROGRESS_LOCK = threading.Lock()
+_PROGRESS_COUNTERS = defaultdict(int)
+
+
+def progress_bar(total_configs, print_every=100, bar_width=40, key: str = "default"):
+    """
+    Thread-safe progress bar.  Rank 0 only.  `key` distinguishes concurrent
+    progress streams if ever needed.  Call once per completed unit of work;
+    the counter is tracked internally via `_PROGRESS_COUNTERS[key]`.
+    """
     rank = MPI.COMM_WORLD.Get_rank()
     if rank != 0:
         return
+    with _PROGRESS_LOCK:
+        _PROGRESS_COUNTERS[key] += 1
+        done = _PROGRESS_COUNTERS[key]
+        percent = done / total_configs
+        filled = int(bar_width * percent)
+        bar = "*" * filled + "-" * (bar_width - filled)
+        if done % print_every == 0 or done == total_configs:
+            click.echo(
+                f"[Rank {rank}] Progress: |{bar}| {percent*100:6.2f}% "
+                f"({done}/{total_configs})"
+            )
 
-    percent = (i + 1) / total_configs
-    filled = int(bar_width * percent)
-    bar = "*" * filled + "-" * (bar_width - filled)
-    if int(i + 1) % print_every == 0 or (i + 1) == total_configs:
-        click.echo(f"[Rank {rank}] Progress: |{bar}| {percent*100:6.2f}% ({i+1}/{total_configs}) [{total_configs} loops per MPI task]")
 
+# =============================================================================
+# TIMING  (thread-safe)
+# =============================================================================
 
-# Global registry of all timings
+_TIMINGS_LOCK = threading.Lock()
 _TIMINGS = defaultdict(list)
 
+
 class Timer:
     def __init__(self, label):
         self.label = label
@@ -645,64 +1836,35 @@ def __enter__(self):
         return self
 
     def __exit__(self, *args):
-        duration = time.perf_counter() - self.start
-        _TIMINGS[self.label].append(duration)
+        with _TIMINGS_LOCK:
+            _TIMINGS[self.label].append(time.perf_counter() - self.start)
+
 
 @atexit.register
 def print_profile_summary():
     if not _TIMINGS:
         return
+    if MPI.COMM_WORLD.Get_rank() != 0:
+        return
 
-    comm = MPI.COMM_WORLD
-    rank = comm.Get_rank()
-    
-    if rank != 0:
-        return # Only the root process prints the summary
-
-    print("\n=== compress_with_zarr Timing Summary ===")
-
-    # Determine max label width for formatting
+    print("\n=== Timing Summary (rank 0; ranks balanced via deterministic shuffle) ===")
+    print("Sum of Total = thread-seconds inside the eval pipeline (excludes bcast,")
+    print("file I/O, dask graph setup, and result-write overhead).")
+    print()
     label_width = max(len(label) for label in _TIMINGS.keys())
-
-    header = (
-        f"{'Label':<{label_width}} | {'Calls':>5} | {'Avg (s)':>10} | {'Total (s)':>10}"
-    )
+    # Per-label totals; grand total is the sum across all labels and is the
+    # denominator for the % column ("how much of the eval pipeline did this
+    # phase consume?").  Edge case: if no time was recorded, show 0% to
+    # avoid a ZeroDivisionError.
+    totals = {label: sum(durations) for label, durations in _TIMINGS.items()}
+    grand_total = sum(totals.values()) or 1.0
+    header = (f"{'Label':<{label_width}} | {'Calls':>5} | {'Avg (s)':>10} | "
+              f"{'Total (s)':>12} | {'% total':>7}")
     print(header)
     print("-" * len(header))
-
     for label, durations in sorted(_TIMINGS.items()):
-        total = sum(durations)
-        count = len(durations)
-        avg = total / count
-        print(f"{label:<{label_width}} | {count:>5} | {avg:>10.6f} | {total:>10.6f}")
-
+        total = totals[label]; count = len(durations); avg = total / count
+        pct = 100.0 * total / grand_total
+        print(f"{label:<{label_width}} | {count:>5} | {avg:>10.6f} | "
+              f"{total:>12.6f} | {pct:>6.2f}%")
     print("=" * len(header))
-
-
-def compute_linear_width(
-    da: xr.DataArray,
-    *,
-    quantile: float = 0.01,
-    skipna: bool = True,
-    floor: float | None = None,
-    cap: float | None = None,
-    compute: bool = False,
-) -> float | xr.DataArray:
-    """Lazy, Dask-enabled estimate of Asinh.linear_width via small-quantile(|da|)."""
-    # mask to finite values (lazy)
-    finite = xr.apply_ufunc(np.isfinite, da, dask="parallelized")
-    abs_da = xr.apply_ufunc(np.abs, da.where(finite), dask="parallelized")
-
-    # small quantile of |da| (lazy)
-    lw = abs_da.quantile(quantile, skipna=skipna)
-
-    # drop the 'quantile' coord that xarray adds
-    if "quantile" in lw.dims:
-        lw = lw.squeeze("quantile", drop=True)
-
-    # Dask-safe bounds without apply_ufunc
-    if floor is not None or cap is not None:
-        lw = lw.clip(min=floor if floor is not None else None,
-                     max=cap if cap is not None else None)
-
-    return float(lw.compute()) if compute else lw