Skip to content

refactor: simplify internal chunk representation#3899

Open
d-v-b wants to merge 34 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-internal-chunk-representation
Open

refactor: simplify internal chunk representation#3899
d-v-b wants to merge 34 commits intozarr-developers:mainfrom
d-v-b:refactor/simplify-internal-chunk-representation

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Apr 12, 2026

The addition of rectilinear chunks left us with some jank in our internal chunk normalization logic. We had a lot of redundant chunk normalization routines, and we also weren't handling user input correctly, e.g. #3898. We need some internal changes to ensure that user input is consistently handled regardless of whether we are generating regular chunks or irregular chunks. That's what this PR does. Also, this PR closes #3898

I will give my summary, then a summary generated by claude.

My summary

ChunksTuple

This PR addresses this by introducing a canonical internal representation of the fully normalized chunk layout for an array, which is a tuple called ChunksTuple. Feel free to suggest better names.

ChunksTuple is just tuple[tuple[int, ...], ...], i.e. a representation compatible with regular or irregular chunks, but I wrap this type in NewType.

I use NewType because tuples of tuples of ints can be very easily confused with tuples of ints (regular chunks), or tuples of tuples of tuples of ints (e.g., rectilinear chunking with RLE). So I think it's helpful to be defensive here and reduce ambiguity.

There are 2 functions that produce ChunksTuple:

  1. normalize_chunks_nd, which converts a user-friendly request for specific chunks into an explicit chunk layout
  2. guess_chunks, which converts a user's request for auto chunking into a specific layout. Auto chunking depends on configuration, data type, etc so this is a separate routine.

ResolvedChunking

ChunksTuple is used in ResolvedChunking (bad name, I would rather use ChunkSpec but that's in use already), which is this:

class ResolvedChunking(NamedTuple):
    outer_chunks: ChunksTuple
    inner_chunks: ChunksTuple | None

ResolvedChunking is what you get when you jointly normalize the chunks and shards keyword arguments to create_array.

I introduce some new terminology here for internal purposes. outer_chunks denotes the shape of the chunks qua stored objects, and inner_chunks denotes the shape of the subchunks inside an outer chunk, if that outer chunk uses sharding. If the outer chunk doesn't use sharding, then inner_chunks is None.

These two data types are used to consolidate our chunk normalization routines.

Claude's Summary

Refactors chunk and shard handling during array creation to fix a naming ambiguity where chunk_shape meant "outer grid partition" without sharding but "inner sub-chunk" with sharding, silently changing meaning based on context.

Introduces a three-layer architecture for chunk resolution:

  1. Normalizationnormalize_chunks_nd and guess_chunks convert raw user input into ChunksTuple, a NewType-branded tuple[tuple[int, ...], ...] that represents both regular and rectilinear chunks uniformly. This is the only boundary between untyped user input and the internal representation.

  2. Resolutionresolve_outer_and_inner_chunks takes a ChunksTuple (the user's chunks=) and raw shard input (shards=), and returns a ResolvedChunking NamedTuple with two unambiguous fields:

    • outer_chunks: ChunksTuple — chunk sizes for the chunk grid metadata (shard sizes when sharding, chunk sizes otherwise)
    • inner_chunks: ChunksTuple | None — sub-chunk sizes for ShardingCodec, or None when sharding is not active
  3. Metadata constructioncreate_chunk_grid_metadata takes a ChunksTuple and dispatches to RegularChunkGridMetadata or RectilinearChunkGridMetadata based on whether the chunks are uniform.

Key design decisions

  • ChunksTuple as a NewType: Zero runtime cost, but the type checker prevents accidentally passing raw user input where normalized chunks are expected. Both regular and rectilinear chunks use the same representation — regular is just the case where each inner tuple has uniform values.

  • inner_chunks: None models capability, not configuration: An unsharded chunk is opaque (read the whole thing or nothing). A shard has internal structure (an index that enables sub-chunk addressing). None means "this chunk has no internal structure" — it's not a flag you toggle, it's the absence of a capability.

  • normalize_chunks_nd rejects None: Top-level None means "auto" everywhere else in the codebase. Having normalize_chunks_nd silently treat it as "span all" would be a bug waiting to happen. Callers must use guess_chunks for auto-chunking.

  • Rectilinear shard detection absorbed into resolve_outer_and_inner_chunks: The function handles all shard input forms (None, "auto", dict, flat tuple, nested sequence) internally, eliminating the shards_for_partition / rectilinear_shard_meta dance that callers previously had to manage.

Changes by file

src/zarr/core/chunk_grids.py

  • Added SHARDED_INNER_CHUNK_MAX_BYTES constant (1 MiB) — replaces magic number used as the auto-chunking ceiling when sharding is active
  • Added ChunksTuple NewType — branded tuple[tuple[int, ...], ...]
  • Added ResolvedChunking NamedTuple — (outer_chunks, inner_chunks)
  • normalize_chunks_nd now returns ChunksTuple, rejects None
  • guess_chunks now returns ChunksTuple (normalizes via normalize_chunks_nd)
  • Replaced resolve_shard_shape (returned flat tuple | None) with resolve_outer_and_inner_chunks (returns ResolvedChunking, absorbs rectilinear shard detection)
  • Removed resolve_chunk_shape (was a lossy flattening wrapper)
  • Removed guess_chunks_and_shards (was dead code)

src/zarr/core/metadata/v3.py

  • create_chunk_grid_metadata now accepts ChunksTuple (no longer normalizes internally, no shape parameter)
  • is_regular_1d rewritten to short-circuit on first mismatch instead of building a full set
  • RST-style docstring syntax replaced with markdown

src/zarr/core/array.py

  • init_array: chunk/shard resolution reduced from ~50 lines of interleaved conditionals to a clean pipeline: normalize → resolve → build metadata. Variables chunk_shape_parsed, shard_shape_parsed, chunks_out, shards_for_partition, and rectilinear_shard_meta eliminated in favor of outer_chunks and inner_chunks.
  • _create (legacy API): same normalize-then-build pattern, consistent outer_chunks naming

tests/conftest.py

  • create_array_metadata updated to use resolve_outer_and_inner_chunks and create_chunk_grid_metadata instead of manually constructing grid metadata dicts

tests/test_chunk_grids.py

  • normalize_chunks_nd tests updated: None moved to error cases, typesize parameter removed
  • Tests use the new function signatures

tests/test_array.py

  • Shard auto-partition tests updated to use resolve_outer_and_inner_chunks
  • Auto-chunk-with-sharding test exercises the full pipeline (guess → resolve → verify divisibility)
  • Uses SHARDED_INNER_CHUNK_MAX_BYTES instead of magic 1048576

d-v-b added 3 commits April 10, 2026 18:26
Previously rectilinear chunk grids and regular chunk grids normalized chunks inconsistently.
This change ensures that chunk specifications are always normalized by the same routines in all cases.

This change also ensures that chunks=(-1, ...) consistently normalizes to a full length chunk along that axis.
@github-actions github-actions Bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026
@github-actions github-actions Bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Apr 12, 2026
@d-v-b d-v-b requested a review from maxrjones April 12, 2026 12:29
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

❌ Patch coverage is 61.53846% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.70%. Comparing base (9681cf9) to head (c40c5ff).

Files with missing lines Patch % Lines
src/zarr/core/chunk_grids.py 55.17% 26 Missing ⚠️
src/zarr/core/array.py 61.11% 7 Missing ⚠️
src/zarr/core/metadata/v3.py 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
- Coverage   92.98%   92.70%   -0.28%     
==========================================
  Files          87       87              
  Lines       11246    11261      +15     
==========================================
- Hits        10457    10440      -17     
- Misses        789      821      +32     
Files with missing lines Coverage Δ
src/zarr/core/metadata/v3.py 92.39% <86.66%> (-0.38%) ⬇️
src/zarr/core/array.py 97.15% <61.11%> (-0.49%) ⬇️
src/zarr/core/chunk_grids.py 89.08% <55.17%> (-7.18%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.26%. Comparing base (3d354a8) to head (424439b).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3899      +/-   ##
==========================================
+ Coverage   93.23%   93.26%   +0.03%     
==========================================
  Files          87       87              
  Lines       11695    11719      +24     
==========================================
+ Hits        10904    10930      +26     
+ Misses        791      789       -2     
Files with missing lines Coverage Δ
src/zarr/core/array.py 97.87% <100.00%> (+0.15%) ⬆️
src/zarr/core/chunk_grids.py 96.81% <100.00%> (+0.54%) ⬆️
src/zarr/core/metadata/v3.py 94.06% <100.00%> (+0.22%) ⬆️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the direction of the refactor.

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

there are a few cases in the deprecated Array.create() method that possibly regress in this PR:

def _create_deprecated(**kwargs):
    """Call the deprecated Array.create(), suppressing the deprecation warning."""
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", DeprecationWarning)
        return zarr.Array.create(**kwargs)


def test_deprecated_underspecified_chunks_padded():
    """Fewer chunk dims than shape dims — missing dims padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30,), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_underspecified_chunks_with_none():
    """Partial chunks with None — padded from shape."""
    arr = _create_deprecated(store={}, shape=(100, 20, 10), chunks=(30, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (30, 20, 10)


def test_deprecated_none_per_dimension_sentinel():
    """None inside chunks tuple means 'span the full axis'."""
    arr = _create_deprecated(store={}, shape=(100, 10), chunks=(10, None), dtype="uint8")
    assert arr.metadata.chunk_grid.chunk_shape == (10, 10)

I'm not sure if these were intentional API design choices, versus quirks in the old API. It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 12, 2026

I found the description of the PR somewhat misleading. The bug fix (make chunk normalization properly handle -1) is totally unrelated to the additional of rectilinear chunk support; the bug report showed the issue on prior releases. The rectilinear chunk addition made the pre-existing jank related to duplicated normalization logic worse.

good catch, the change that broke -1 normalization was this one: #2761. We basically forked array creation routines and didn't reach feature / testing parity with the new one 🤦

I don't see value in supporting cases like this, other than backwards compatibility.

shape=(100, 20, 10), chunks=(30,)
shape=(100, 20, 10), chunks=(30, None)

Are there any non-deprecated functions that supported this?

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 12, 2026

It may be a good time to remove deprecated functions, as a separate PR, first to reduce the surface area for potential regressions when fixing/adding functionality to the new API.

💯

@maxrjones
Copy link
Copy Markdown
Member

I don't see value in supporting cases like this, other than backwards compatibility. Are there any non-deprecated functions that supported this?

I couldn't find any non-deprecated cases of supporting underspecified chunks (fewer than the number of dims) or using None as a sentinel value like -1.

@d-v-b d-v-b changed the title refactor/simplify internal chunk representation refactor: simplify internal chunk representation Apr 13, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 13, 2026

#3903 removes the deprecated methods

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Apr 13, 2026

the latest changes make the representation of chunks recursive, in order to express nested sharding. This is future-proofing the design here against the possibility that we give our high-level routines a simple way to declare nested sharding.

Comment thread src/zarr/core/chunk_grids.py Outdated
d-v-b and others added 6 commits April 20, 2026 19:50
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch normalize_chunks_1d to return np.ndarray[tuple[int], np.dtype[np.int64]]
instead of tuple[int, ...]. The uniform-chunks branch now constructs in O(1)
via np.full, recovering the single-allocation fast path that regressed when
the canonical ChunksTuple representation was introduced.

Update create_chunk_grid_metadata in v3.py to convert arrays to tuples of ints
before passing to is_regular_nd and RectilinearChunkGridMetadata, keeping those
downstream functions' signatures unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit 14788aa was meant to only touch chunk_grids.py
(Tasks 2+3 of the ChunksTuple → int64-array refactor). It also
modified create_chunk_grid_metadata in v3.py — that change belongs
to a later task with a different approach (widen annotations rather
than materialize tuples) and a better perf profile.

Restoring v3.py to its pre-14788aa state. The proper v3.py change
will land in a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unk_grid_metadata

Accept ndarray[int64] in is_regular_1d/is_regular_nd alongside Sequence[int].
Cast only the first element per axis on the regular path so D ints are
allocated rather than N*D. Materialize fully on the rectilinear path because
_validate_chunk_shapes checks isinstance(dim_spec, int) which rejects np.int64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d-v-b and others added 4 commits April 20, 2026 23:16
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erization

Use _assert_chunks_equal for the three call sites that compared a tuple of
int64 arrays against a tuple of int tuples. Add three small tests asserting
that normalize_chunks_1d returns a 1D int64 ndarray for uniform, explicit-list,
and -1 sentinel inputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-element Python iteration over a 100K-element int64 array dominated
create_array runtime on the (10**8,) chunks=(1000,) regression case
(~6 ms in is_regular_nd, downstream of ChunksTuple normalization).
Dispatch on np.ndarray and use a single vectorized comparison instead.

End-to-end create_array on the regression case: ~7.9 ms -> ~0.6 ms.

Promote numpy to a runtime import (was TYPE_CHECKING-only) for the
isinstance dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Names the structure (a layout) rather than the operation that produced it.
Reads cleanly for both sharded and unsharded cases, fits the recursive
inner-layout pattern, and is what one reaches for when reading the code cold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@d-v-b d-v-b added the benchmark Code will be benchmarked in a CI job. label Apr 21, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 21, 2026

Merging this PR will not alter performance

✅ 66 untouched benchmarks
⏩ 6 skipped benchmarks1


Comparing d-v-b:refactor/simplify-internal-chunk-representation (4bc4678) with main (dd5a321)

Open in CodSpeed

Footnotes

  1. 6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 5, 2026

@maxrjones your example from #3946 produces the following error message on this branch:

TypeError: Each chunk size must be an integer; got non-integer element(s) ([3, 3],) at indices (0,). Chunk sizes must be declared as a flat sequence of positive integers (e.g. [3, 3, 1]).

@maxrjones
Copy link
Copy Markdown
Member

@maxrjones your example from #3946 produces the following error message on this branch:


TypeError: Each chunk size must be an integer; got non-integer element(s) ([3, 3],) at indices (0,). Chunk sizes must be declared as a flat sequence of positive integers (e.g. [3, 3, 1]).

That's awesome, thank you! Is this branch ready for a review? I can take a look tomorrow AM if so

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 5, 2026

yes it's ready

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 5, 2026

flagging one behavioral change here: in main, chunks=True was accepted, as we interpreted True as the number 1:

import zarr
arr = zarr.create_array(store="memory://test", shape=(1,), chunks=True, dtype="int8")
print(arr.chunks)
# (True,)

I don't think we ever intended this. Given the choice between canonicalizing the odd behavior in main and rejecting it as an error, this PR makes chunks=True a ValueError.

@maxrjones
Copy link
Copy Markdown
Member

flagging one behavioral change here: in main, chunks=True was accepted, as we interpreted True as the number 1:


import zarr

arr = zarr.create_array(store="memory://test", shape=(1,), chunks=True, dtype="int8")

print(arr.chunks)

# (True,)

I don't think we ever intended this. Given the choice between canonicalizing the odd behavior in main and rejecting it as an error, this PR makes chunks=True a ValueError.

OOC was this a consequence of the addition of rectilinear chunk grids or always the case? It was indeed not intended by my PR, if the former

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 5, 2026

before the rectilinear chunks PR, we converted True to the integer value 1 in the chunks attribute. After that PR, we kept True as a boolean, but practically interpreted it as the integer 1. Both of these are weird and not really intended.

Copy link
Copy Markdown
Member

@maxrjones maxrjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this, I really like the direction.

I don't think the chunks=True fix is working, below is an MVCE:

# /// script
# requires-python = ">=3.11"
# dependencies = [
#     "zarr @ git+https://github.com/d-v-b/zarr-python@refactor/simplify-internal-chunk-representation",
#     "numpy",
# ]
# ///
"""
MVCE: PR #3899 — `chunks=True` was claimed to raise ValueError but silently
produces size-1 chunks.

Author's claim (PR comment):
    "Given the choice between canonicalizing the odd behavior in `main` and
    rejecting it as an error, this PR makes `chunks=True` a `ValueError`."

Actual behavior on the PR branch: no error — the array is created with
chunks=(1, 1, ...), one element per chunk on every dimension. On a large
shape this is a serious footgun (millions of objects in the store).
"""

import zarr

# Case 1: trivial shape — author's exact example. Silently produces (1,).
arr = zarr.create_array(store={}, shape=(1,), chunks=True, dtype="int8")
print(f"shape=(1,)        chunks={arr.chunks}    write_chunk_sizes={arr.write_chunk_sizes}")
assert arr.chunks == (1,), f"expected ValueError, got chunks={arr.chunks}"

# Case 2: realistic shape — silently produces a million size-1 chunks.
arr = zarr.create_array(store={}, shape=(1_000_000,), chunks=True, dtype="int8")
print(f"shape=(1_000_000,) chunks={arr.chunks}  nchunks={arr.nchunks}")
assert arr.chunks == (1,)
assert arr.nchunks == 1_000_000

# Case 3: multi-dim — every axis gets size-1 chunks.
arr = zarr.create_array(store={}, shape=(100, 100, 100), chunks=True, dtype="int8")
print(f"shape=(100,100,100) chunks={arr.chunks} nchunks={arr.nchunks}")
assert arr.chunks == (1, 1, 1)
assert arr.nchunks == 1_000_000

# Direct call into the internal function — same path, no auto-chunking guard
# in init_array catches the bool, so it falls into the Integral branch.
from zarr.core.chunk_grids import normalize_chunks_nd

result = normalize_chunks_nd(True, (5,))
print(f"normalize_chunks_nd(True, (5,)) -> {[a.tolist() for a in result]}")
assert [a.tolist() for a in result] == [[1, 1, 1, 1, 1]]

print("\nAll asserts passed: `chunks=True` does NOT raise ValueError on this branch.")
print("Root cause: `bool` is a subclass of `int`, so:")
print("  - _is_rectilinear_chunks(True) returns False (caught by isinstance(_, int))")
print("  - init_array's `chunks is None or chunks == 'auto'` guard does not match True")
print("  - normalize_chunks_nd's `isinstance(chunks, numbers.Integral)` matches True -> int(True) == 1")

A minor thing, but it'd be nice if you could get Claude to update the bugfix.md and PR description to be up-to-date in case we need to refer back to this PR later on.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 6, 2026

I don't think the chunks=True fix is working, below is an MVCE:

I hadn't pushed my local changes 🤦 now that example errors when chunks-True.

A minor thing, but it'd be nice if you could get Claude to update the bugfix.md and PR description to be up-to-date in case we need to refer back to this PR later on.

Definitely!

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented May 6, 2026

@maxrjones here's claude's summary (I prefer an addition over editing the opening post in the PR)

PR #3899 — refactor: simplify internal chunk representation

Why this PR exists

init_array's chunk/shard resolution path had accumulated branches for
auto-chunking, sharding, and (most recently) rectilinear chunks/shards, with
helper locals such as chunks_flat, shards_for_partition, rectilinear_meta,
chunk_shape_parsed, and shard_shape_parsed running in parallel. Within a
single function chunk_shape could refer to the outer grid partition or the
inner sub-chunk size depending on which branch you were in, which made the
code hard to follow.

While investigating #3898 (chunks=-1 no longer behaving as expected — the
underlying regression came from #2761), it became clear that consolidating
the resolution pipeline behind a single canonical representation would fix
the reported bug and make init_array easier to reason about going forward.

This PR does that consolidation, and folds in several related fixes that
surfaced during the refactor.

Internal architecture

A user-supplied chunk specification now flows through three explicit stages:

  1. Normalize raw user input → ChunksTuple
    (normalize_chunks_nd for explicit specs, guess_chunks for auto-chunking).
  2. Resolve chunks + shards → ChunkLayout
    (resolve_outer_and_inner_chunks).
  3. Materialize grid metadata
    (create_chunk_grid_metadata, dispatching to RegularChunkGridMetadata
    or RectilinearChunkGridMetadata).

init_array is now a composition of these three steps. The legacy
AsyncArray._create path uses the same normalize → grid pipeline.

ChunksTuple

ChunksTuple = NewType(
    "ChunksTuple", tuple[np.ndarray[tuple[int], np.dtype[np.int64]], ...]
)

One 1D int64 array per axis. Regular chunks and rectilinear chunks share this
shape — regular grids are simply the case where every value in each per-axis
array is identical (with an optional smaller boundary chunk).

The NewType brand prevents passing raw user input where validated chunks are
expected. The numpy-array element type was chosen after benchmarking. An
earlier iteration of this PR used plain tuple[int, ...] per axis, which on
shape=(10**8,) chunks=(1000,) ended up materializing 100K Python ints per
axis and ran ~16× slower end-to-end than main. Switching to np.int64
arrays restored parity, and the regularity helpers below are vectorized to
keep it that way.

ChunkLayout

class ChunkLayout(NamedTuple):
    outer_chunks: ChunksTuple
    inner: ChunkLayout | None = None

outer_chunks is the chunk grid as stored. inner is the sub-structure inside
each chunk: None means no sharding (the chunk is opaque), and a nested
ChunkLayout means the chunk is a shard with its own sub-grid. The recursion
is intentional — it leaves room for a future high-level API that exposes
nested sharding without another schema break.

is_regular_1d / is_regular_nd

Vectorized predicates that decide between RegularChunkGridMetadata and
RectilinearChunkGridMetadata at metadata-construction time. The numpy path
short-circuits on the first mismatch; the sequence fallback iterates.

User-visible behavior changes

The chunk-spec input grammar is tightened to a single canonical form per
dimension. The full set of changes is documented in changes/3899.bugfix.md.

Input Before After
chunks=-1 Not handled (#3898) Works: full extent of axis
chunks=True Produced (1, 1, …) chunks ValueError pointing at "auto"
chunks=(30,) for a 3D shape Padded with shape[len(chunks):] ValueError: dimension mismatch
chunks=(30, None, None) None interpreted as full-extent sentinel ValueError: per-dim None rejected
chunks=[[3, 3], 1] (RLE form) TypeError from a nested comparison TypeError reporting offending indices
chunks=0 or list containing 0 Error raised at later validation step ValueError: "Chunk size must be positive"
Array with a 0-length dimension Inconsistent behavior across input forms Handled uniformly

The "padding short tuples" and "None per-dim" forms are not exercised by
the test suite or the documented public API. After this PR, init_array
expects the caller to choose between an explicit per-dimension spec, a scalar,
or auto-chunking.

Files changed

File What changed
src/zarr/core/chunk_grids.py New: ChunksTuple, ChunkLayout, SHARDED_INNER_CHUNK_MAX_BYTES, normalize_chunks_1d, normalize_chunks_nd, guess_chunks, resolve_outer_and_inner_chunks. Removed: normalize_chunks, _auto_partition, _guess_chunks (renamed to _guess_regular_chunks).
src/zarr/core/metadata/v3.py New: is_regular_1d, is_regular_nd, create_chunk_grid_metadata. Removed: resolve_chunks (its responsibilities split between normalize and create).
src/zarr/core/array.py init_array chunk/shard resolution simplified to a normalize → resolve → materialize pipeline. The legacy AsyncArray._create path uses the same primitives. About 50 lines of interleaved conditionals collapsed.
tests/conftest.py create_array_metadata helper rewritten on top of the new primitives.
tests/test_chunk_grids.py Tests reorganized around the new functions. Parametrized cases covering both happy paths (sentinels, explicit specs, rectilinear) and the new error paths (zero/negative chunks, True, length mismatch, RLE rejection, None rejection). Added a _assert_chunks_equal helper that compares ChunksTuple against tuples of int tuples.
tests/test_array.py Shard auto-partition tests updated to call resolve_outer_and_inner_chunks. End-to-end test for auto-chunking + sharding using SHARDED_INNER_CHUNK_MAX_BYTES.
tests/test_metadata/test_v3.py New: 21 parametrized tests for is_regular_1d (sequence + ndarray paths) and is_regular_nd.

The non-source-code files in the diff (.github/workflows/*, pyproject.toml,
tests/test_codecs/test_sharding.py, tests/test_store/test_fsspec.py) come
from main merges that have happened during the PR's lifetime and are not
part of the chunk-representation work.

Performance

After the vectorization commits (14788aa, 33ee8a1), create_array on a
1D shape with ~100K chunks is at parity with main. Other shapes were never
measurably affected. The CodSpeed report on the PR shows no regressions
across the existing benchmark suite.

Out-of-scope follow-ups identified during review

These came up in the review thread and were deferred to keep this PR focused.

  • The legacy Array.create / AsyncArray.create paths still accept the
    looser v2-era input grammar that this PR tightens elsewhere. chore: remove .create methods from arrays #3903 removes
    those deprecated methods entirely; once that lands, the legacy _create
    branch can be deleted.
  • is_regular_1d lives in metadata/v3.py but operates on the runtime
    ChunksTuple defined in chunk_grids.py. Hoisting
    create_chunk_grid_metadata into chunk_grids.py would tighten the
    module boundary, but isn't done here to keep the diff manageable.

Migration notes for downstream code

Nothing publicly exported from the zarr namespace changed. The removed
helpers (zarr.core.chunk_grids.normalize_chunks, zarr.core.metadata.v3.resolve_chunks)
lived under zarr.core.*, the documented internal namespace. If a downstream
project happens to import them, the equivalent calls are:

# old
from zarr.core.chunk_grids import normalize_chunks
chunks_t = normalize_chunks(chunks, shape, item_size)

# new
from zarr.core.chunk_grids import normalize_chunks_nd, guess_chunks
if chunks is None:
    chunks_t = guess_chunks(shape, item_size)
else:
    chunks_t = normalize_chunks_nd(chunks, shape)
# chunks_t is a ChunksTuple — one np.int64 array per axis.
# old
from zarr.core.metadata.v3 import resolve_chunks
grid = resolve_chunks(raw_chunks, shape, item_size)

# new
from zarr.core.metadata.v3 import create_chunk_grid_metadata
grid = create_chunk_grid_metadata(chunks_t)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Code will be benchmarked in a CI job.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

given shape=(s,) chunks=(-1,) should mean (s,)

2 participants