refactor: simplify internal chunk representation#3899
refactor: simplify internal chunk representation#3899d-v-b wants to merge 34 commits intozarr-developers:mainfrom
Conversation
Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently. This change ensures that chunk specifications are always normalized by the same routines in all cases. This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3899 +/- ##
==========================================
- Coverage 92.98% 92.70% -0.28%
==========================================
Files 87 87
Lines 11246 11261 +15
==========================================
- Hits 10457 10440 -17
- Misses 789 821 +32
🚀 New features to boost your workflow:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3899 +/- ##
==========================================
+ Coverage 93.23% 93.26% +0.03%
==========================================
Files 87 87
Lines 11695 11719 +24
==========================================
+ Hits 10904 10930 +26
+ Misses 791 789 -2
🚀 New features to boost your workflow:
|
maxrjones
left a comment
There was a problem hiding this comment.
I really like the direction of the refactor.
I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.
there are a few cases in the deprecated Array.create() method that possibly regress in this PR:
def _create_deprecated(**kwargs):
"""Call the deprecated Array.create(), suppressing the deprecation warning."""
with warnings.catch_warnings():
warnings.simplefilter("ignore", DeprecationWarning)
return zarr.Array.create(**kwargs)
def test_deprecated_underspecified_chunks_padded():
"""Fewer chunk dims than shape dims — missing dims padded from shape."""
arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30,), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)
def test_deprecated_underspecified_chunks_with_none():
"""Partial chunks with None — padded from shape."""
arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30, None), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)
def test_deprecated_none_per_dimension_sentinel():
"""None inside chunks tuple means 'span the full axis'."""
arr = _create_deprecated(store={}, shape=(100, 10), chunks=(10, None), dtype="uint8")
assert arr.metadata.chunk_grid.chunk_shape == (10, 10)I'm not sure if these were intentional API design choices, versus quirks in the old API. It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.
good catch, the change that broke -1 normalization was this one: #2761. We basically forked array creation routines and didn't reach feature / testing parity with the new one 🤦 I don't see value in supporting cases like this, other than backwards compatibility.
Are there any non-deprecated functions that supported this? |
💯 |
I couldn't find any non-deprecated cases of supporting underspecified chunks (fewer than the number of dims) or using None as a sentinel value like |
|
#3903 removes the deprecated methods |
|
the latest changes make the representation of chunks recursive, in order to express nested sharding. This is future-proofing the design here against the possibility that we give our high-level routines a simple way to declare nested sharding. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch normalize_chunks_1d to return np.ndarray[tuple[int], np.dtype[np.int64]] instead of tuple[int, ...]. The uniform-chunks branch now constructs in O(1) via np.full, recovering the single-allocation fast path that regressed when the canonical ChunksTuple representation was introduced. Update create_chunk_grid_metadata in v3.py to convert arrays to tuples of ints before passing to is_regular_nd and RectilinearChunkGridMetadata, keeping those downstream functions' signatures unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit 14788aa was meant to only touch chunk_grids.py (Tasks 2+3 of the ChunksTuple → int64-array refactor). It also modified create_chunk_grid_metadata in v3.py — that change belongs to a later task with a different approach (widen annotations rather than materialize tuples) and a better perf profile. Restoring v3.py to its pre-14788aa state. The proper v3.py change will land in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unk_grid_metadata Accept ndarray[int64] in is_regular_1d/is_regular_nd alongside Sequence[int]. Cast only the first element per axis on the regular path so D ints are allocated rather than N*D. Materialize fully on the rectilinear path because _validate_chunk_shapes checks isinstance(dim_spec, int) which rejects np.int64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erization Use _assert_chunks_equal for the three call sites that compared a tuple of int64 arrays against a tuple of int tuples. Add three small tests asserting that normalize_chunks_1d returns a 1D int64 ndarray for uniform, explicit-list, and -1 sentinel inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-element Python iteration over a 100K-element int64 array dominated create_array runtime on the (10**8,) chunks=(1000,) regression case (~6 ms in is_regular_nd, downstream of ChunksTuple normalization). Dispatch on np.ndarray and use a single vectorized comparison instead. End-to-end create_array on the regression case: ~7.9 ms -> ~0.6 ms. Promote numpy to a runtime import (was TYPE_CHECKING-only) for the isinstance dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Names the structure (a layout) rather than the operation that produced it. Reads cleanly for both sharded and unsharded cases, fits the recursive inner-layout pattern, and is what one reaches for when reading the code cold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merging this PR will not alter performance
Comparing Footnotes
|
|
@maxrjones your example from #3946 produces the following error message on this branch: |
That's awesome, thank you! Is this branch ready for a review? I can take a look tomorrow AM if so |
|
yes it's ready |
|
flagging one behavioral change here: in I don't think we ever intended this. Given the choice between canonicalizing the odd behavior in |
OOC was this a consequence of the addition of rectilinear chunk grids or always the case? It was indeed not intended by my PR, if the former |
|
before the rectilinear chunks PR, we converted |
maxrjones
left a comment
There was a problem hiding this comment.
Thanks for continuing to work on this, I really like the direction.
I don't think the chunks=True fix is working, below is an MVCE:
# /// script
# requires-python = ">=3.11"
# dependencies = [
# "zarr @ git+https://github.com/d-v-b/zarr-python@refactor/simplify-internal-chunk-representation",
# "numpy",
# ]
# ///
"""
MVCE: PR #3899 — `chunks=True` was claimed to raise ValueError but silently
produces size-1 chunks.
Author's claim (PR comment):
"Given the choice between canonicalizing the odd behavior in `main` and
rejecting it as an error, this PR makes `chunks=True` a `ValueError`."
Actual behavior on the PR branch: no error — the array is created with
chunks=(1, 1, ...), one element per chunk on every dimension. On a large
shape this is a serious footgun (millions of objects in the store).
"""
import zarr
# Case 1: trivial shape — author's exact example. Silently produces (1,).
arr = zarr.create_array(store={}, shape=(1,), chunks=True, dtype="int8")
print(f"shape=(1,) chunks={arr.chunks} write_chunk_sizes={arr.write_chunk_sizes}")
assert arr.chunks == (1,), f"expected ValueError, got chunks={arr.chunks}"
# Case 2: realistic shape — silently produces a million size-1 chunks.
arr = zarr.create_array(store={}, shape=(1_000_000,), chunks=True, dtype="int8")
print(f"shape=(1_000_000,) chunks={arr.chunks} nchunks={arr.nchunks}")
assert arr.chunks == (1,)
assert arr.nchunks == 1_000_000
# Case 3: multi-dim — every axis gets size-1 chunks.
arr = zarr.create_array(store={}, shape=(100, 100, 100), chunks=True, dtype="int8")
print(f"shape=(100,100,100) chunks={arr.chunks} nchunks={arr.nchunks}")
assert arr.chunks == (1, 1, 1)
assert arr.nchunks == 1_000_000
# Direct call into the internal function — same path, no auto-chunking guard
# in init_array catches the bool, so it falls into the Integral branch.
from zarr.core.chunk_grids import normalize_chunks_nd
result = normalize_chunks_nd(True, (5,))
print(f"normalize_chunks_nd(True, (5,)) -> {[a.tolist() for a in result]}")
assert [a.tolist() for a in result] == [[1, 1, 1, 1, 1]]
print("\nAll asserts passed: `chunks=True` does NOT raise ValueError on this branch.")
print("Root cause: `bool` is a subclass of `int`, so:")
print(" - _is_rectilinear_chunks(True) returns False (caught by isinstance(_, int))")
print(" - init_array's `chunks is None or chunks == 'auto'` guard does not match True")
print(" - normalize_chunks_nd's `isinstance(chunks, numbers.Integral)` matches True -> int(True) == 1")A minor thing, but it'd be nice if you could get Claude to update the bugfix.md and PR description to be up-to-date in case we need to refer back to this PR later on.
…into refactor/simplify-internal-chunk-representation
…ps://github.com/d-v-b/zarr-python into refactor/simplify-internal-chunk-representation
I hadn't pushed my local changes 🤦 now that example errors when
Definitely! |
|
@maxrjones here's claude's summary (I prefer an addition over editing the opening post in the PR) PR #3899 — refactor: simplify internal chunk representationWhy this PR exists
While investigating #3898 ( This PR does that consolidation, and folds in several related fixes that Internal architectureA user-supplied chunk specification now flows through three explicit stages:
|
| Input | Before | After |
|---|---|---|
chunks=-1 |
Not handled (#3898) | Works: full extent of axis |
chunks=True |
Produced (1, 1, …) chunks |
ValueError pointing at "auto" |
chunks=(30,) for a 3D shape |
Padded with shape[len(chunks):] |
ValueError: dimension mismatch |
chunks=(30, None, None) |
None interpreted as full-extent sentinel |
ValueError: per-dim None rejected |
chunks=[[3, 3], 1] (RLE form) |
TypeError from a nested comparison |
TypeError reporting offending indices |
chunks=0 or list containing 0 |
Error raised at later validation step | ValueError: "Chunk size must be positive" |
| Array with a 0-length dimension | Inconsistent behavior across input forms | Handled uniformly |
The "padding short tuples" and "None per-dim" forms are not exercised by
the test suite or the documented public API. After this PR, init_array
expects the caller to choose between an explicit per-dimension spec, a scalar,
or auto-chunking.
Files changed
| File | What changed |
|---|---|
src/zarr/core/chunk_grids.py |
New: ChunksTuple, ChunkLayout, SHARDED_INNER_CHUNK_MAX_BYTES, normalize_chunks_1d, normalize_chunks_nd, guess_chunks, resolve_outer_and_inner_chunks. Removed: normalize_chunks, _auto_partition, _guess_chunks (renamed to _guess_regular_chunks). |
src/zarr/core/metadata/v3.py |
New: is_regular_1d, is_regular_nd, create_chunk_grid_metadata. Removed: resolve_chunks (its responsibilities split between normalize and create). |
src/zarr/core/array.py |
init_array chunk/shard resolution simplified to a normalize → resolve → materialize pipeline. The legacy AsyncArray._create path uses the same primitives. About 50 lines of interleaved conditionals collapsed. |
tests/conftest.py |
create_array_metadata helper rewritten on top of the new primitives. |
tests/test_chunk_grids.py |
Tests reorganized around the new functions. Parametrized cases covering both happy paths (sentinels, explicit specs, rectilinear) and the new error paths (zero/negative chunks, True, length mismatch, RLE rejection, None rejection). Added a _assert_chunks_equal helper that compares ChunksTuple against tuples of int tuples. |
tests/test_array.py |
Shard auto-partition tests updated to call resolve_outer_and_inner_chunks. End-to-end test for auto-chunking + sharding using SHARDED_INNER_CHUNK_MAX_BYTES. |
tests/test_metadata/test_v3.py |
New: 21 parametrized tests for is_regular_1d (sequence + ndarray paths) and is_regular_nd. |
The non-source-code files in the diff (.github/workflows/*, pyproject.toml,
tests/test_codecs/test_sharding.py, tests/test_store/test_fsspec.py) come
from main merges that have happened during the PR's lifetime and are not
part of the chunk-representation work.
Performance
After the vectorization commits (14788aa, 33ee8a1), create_array on a
1D shape with ~100K chunks is at parity with main. Other shapes were never
measurably affected. The CodSpeed report on the PR shows no regressions
across the existing benchmark suite.
Out-of-scope follow-ups identified during review
These came up in the review thread and were deferred to keep this PR focused.
- The legacy
Array.create/AsyncArray.createpaths still accept the
looser v2-era input grammar that this PR tightens elsewhere. chore: remove .create methods from arrays #3903 removes
those deprecated methods entirely; once that lands, the legacy_create
branch can be deleted. is_regular_1dlives inmetadata/v3.pybut operates on the runtime
ChunksTupledefined inchunk_grids.py. Hoisting
create_chunk_grid_metadataintochunk_grids.pywould tighten the
module boundary, but isn't done here to keep the diff manageable.
Migration notes for downstream code
Nothing publicly exported from the zarr namespace changed. The removed
helpers (zarr.core.chunk_grids.normalize_chunks, zarr.core.metadata.v3.resolve_chunks)
lived under zarr.core.*, the documented internal namespace. If a downstream
project happens to import them, the equivalent calls are:
# old
from zarr.core.chunk_grids import normalize_chunks
chunks_t = normalize_chunks(chunks, shape, item_size)
# new
from zarr.core.chunk_grids import normalize_chunks_nd, guess_chunks
if chunks is None:
chunks_t = guess_chunks(shape, item_size)
else:
chunks_t = normalize_chunks_nd(chunks, shape)
# chunks_t is a ChunksTuple — one np.int64 array per axis.# old
from zarr.core.metadata.v3 import resolve_chunks
grid = resolve_chunks(raw_chunks, shape, item_size)
# new
from zarr.core.metadata.v3 import create_chunk_grid_metadata
grid = create_chunk_grid_metadata(chunks_t)
The addition of rectilinear chunks left us with some jank in our internal chunk normalization logic. We had a lot of redundant chunk normalization routines, and we also weren't handling user input correctly, e.g. #3898. We need some internal changes to ensure that user input is consistently handled regardless of whether we are generating regular chunks or irregular chunks. That's what this PR does. Also, this PR closes #3898
I will give my summary, then a summary generated by claude.
My summary
ChunksTupleThis PR addresses this by introducing a canonical internal representation of the fully normalized chunk layout for an array, which is a tuple called
ChunksTuple. Feel free to suggest better names.ChunksTupleis justtuple[tuple[int, ...], ...], i.e. a representation compatible with regular or irregular chunks, but I wrap this type inNewType.I use
NewTypebecause tuples of tuples of ints can be very easily confused with tuples of ints (regular chunks), or tuples of tuples of tuples of ints (e.g., rectilinear chunking with RLE). So I think it's helpful to be defensive here and reduce ambiguity.There are 2 functions that produce
ChunksTuple:normalize_chunks_nd, which converts a user-friendly request for specific chunks into an explicit chunk layoutguess_chunks, which converts a user's request for auto chunking into a specific layout. Auto chunking depends on configuration, data type, etc so this is a separate routine.ResolvedChunkingChunksTupleis used inResolvedChunking(bad name, I would rather useChunkSpecbut that's in use already), which is this:ResolvedChunkingis what you get when you jointly normalize thechunksandshardskeyword arguments tocreate_array.I introduce some new terminology here for internal purposes.
outer_chunksdenotes the shape of the chunks qua stored objects, andinner_chunksdenotes the shape of the subchunks inside an outer chunk, if that outer chunk uses sharding. If the outer chunk doesn't use sharding, theninner_chunksisNone.These two data types are used to consolidate our chunk normalization routines.
Claude's Summary
Refactors chunk and shard handling during array creation to fix a naming ambiguity where
chunk_shapemeant "outer grid partition" without sharding but "inner sub-chunk" with sharding, silently changing meaning based on context.Introduces a three-layer architecture for chunk resolution:
Normalization —
normalize_chunks_ndandguess_chunksconvert raw user input intoChunksTuple, aNewType-brandedtuple[tuple[int, ...], ...]that represents both regular and rectilinear chunks uniformly. This is the only boundary between untyped user input and the internal representation.Resolution —
resolve_outer_and_inner_chunkstakes aChunksTuple(the user'schunks=) and raw shard input (shards=), and returns aResolvedChunkingNamedTuple with two unambiguous fields:outer_chunks: ChunksTuple— chunk sizes for the chunk grid metadata (shard sizes when sharding, chunk sizes otherwise)inner_chunks: ChunksTuple | None— sub-chunk sizes forShardingCodec, orNonewhen sharding is not activeMetadata construction —
create_chunk_grid_metadatatakes aChunksTupleand dispatches toRegularChunkGridMetadataorRectilinearChunkGridMetadatabased on whether the chunks are uniform.Key design decisions
ChunksTupleas aNewType: Zero runtime cost, but the type checker prevents accidentally passing raw user input where normalized chunks are expected. Both regular and rectilinear chunks use the same representation — regular is just the case where each inner tuple has uniform values.inner_chunks: Nonemodels capability, not configuration: An unsharded chunk is opaque (read the whole thing or nothing). A shard has internal structure (an index that enables sub-chunk addressing).Nonemeans "this chunk has no internal structure" — it's not a flag you toggle, it's the absence of a capability.normalize_chunks_ndrejectsNone: Top-levelNonemeans "auto" everywhere else in the codebase. Havingnormalize_chunks_ndsilently treat it as "span all" would be a bug waiting to happen. Callers must useguess_chunksfor auto-chunking.Rectilinear shard detection absorbed into
resolve_outer_and_inner_chunks: The function handles all shard input forms (None,"auto", dict, flat tuple, nested sequence) internally, eliminating theshards_for_partition/rectilinear_shard_metadance that callers previously had to manage.Changes by file
src/zarr/core/chunk_grids.pySHARDED_INNER_CHUNK_MAX_BYTESconstant (1 MiB) — replaces magic number used as the auto-chunking ceiling when sharding is activeChunksTupleNewType — brandedtuple[tuple[int, ...], ...]ResolvedChunkingNamedTuple —(outer_chunks, inner_chunks)normalize_chunks_ndnow returnsChunksTuple, rejectsNoneguess_chunksnow returnsChunksTuple(normalizes vianormalize_chunks_nd)resolve_shard_shape(returned flattuple | None) withresolve_outer_and_inner_chunks(returnsResolvedChunking, absorbs rectilinear shard detection)resolve_chunk_shape(was a lossy flattening wrapper)guess_chunks_and_shards(was dead code)src/zarr/core/metadata/v3.pycreate_chunk_grid_metadatanow acceptsChunksTuple(no longer normalizes internally, noshapeparameter)is_regular_1drewritten to short-circuit on first mismatch instead of building a fullsetsrc/zarr/core/array.pyinit_array: chunk/shard resolution reduced from ~50 lines of interleaved conditionals to a clean pipeline: normalize → resolve → build metadata. Variableschunk_shape_parsed,shard_shape_parsed,chunks_out,shards_for_partition, andrectilinear_shard_metaeliminated in favor ofouter_chunksandinner_chunks._create(legacy API): same normalize-then-build pattern, consistentouter_chunksnamingtests/conftest.pycreate_array_metadataupdated to useresolve_outer_and_inner_chunksandcreate_chunk_grid_metadatainstead of manually constructing grid metadata dictstests/test_chunk_grids.pynormalize_chunks_ndtests updated:Nonemoved to error cases,typesizeparameter removedtests/test_array.pyresolve_outer_and_inner_chunksSHARDED_INNER_CHUNK_MAX_BYTESinstead of magic1048576