Rechunker
Rechunker is a Python package designed for efficient and scalable manipulation of the chunk structure of chunked array formats, such as Zarr and TileDB. It takes an input array (or group of arrays) from persistent storage and writes out a new array with the same data but a different chunking scheme, often utilizing an intermediate temporary store. It is currently at version 0.5.4 and is actively maintained by the Pangeo community, with regular releases addressing compatibility and bug fixes.
Common errors
-
cannot open result of rechunker with xarray
cause Compatibility issues between `rechunker` and specific `xarray` or `zarr` versions, or incorrect metadata handling during the rechunking process.fixUpdate `rechunker`, `xarray`, and `zarr` to their latest compatible versions. Verify if `consolidated=True` is used when writing to Zarr if applicable, and ensure all attributes are properly copied (addressed in `v0.5.1`). -
rechunker object has no attribute 'persist'
cause Misunderstanding the `Rechunked` object's API. The `Rechunked` object returned by `rechunk()` is a plan that needs to be explicitly executed, not a Dask array that can be `persisted`.fixAfter creating the `rechunked` plan, call its `.execute()` method to perform the rechunking operation: `rechunked_plan = rechunk(...); result_array = rechunked_plan.execute()`. -
ZeroDivisionError in L70 of api.py
cause This error likely indicates an issue with internal calculations related to chunk sizes or memory allocation, potentially occurring when `rechunker` tries to determine the optimal number of chunks or operations.fixReview `target_chunks` and `max_mem` parameters to ensure they are valid and sensible for the input array's dimensions and data type. Check GitHub issues for similar reported problems and potential workarounds, especially for edge cases with very small or very large dimensions.
Warnings
- breaking Breaking changes for `xarray` and `dask` compatibility have occurred in some minor versions. For example, `v0.5.4` includes a fix for `xarray>=2025.03.1` and `v0.5.3` for `dask>=2024.12.0` and `xarray>=2024.10.0`.
- gotcha The `rechunk` function in Dask can run out of memory for 'full rechunk' operations where every source chunk maps to every target chunk. Rechunker specifically addresses this by leveraging persistent intermediate storage, but users often confuse this with Dask's in-memory `rechunk`.
- gotcha Rechunker currently assumes uniform chunks for input arrays (except for the last chunk). This can cause issues with Dask arrays that have been filtered or concatenated Zarr arrays, which may result in non-uniform chunk sizes.
Install
-
pip install rechunker
Imports
- rechunk
from rechunker import rechunk
Quickstart
import zarr
from rechunker import rechunk
import os
# Create a source Zarr array
source_store = 'source.zarr'
if not os.path.exists(source_store):
zarr.ones((10, 10, 10), chunks=(2, 2, 2), store=source_store, overwrite=True)
source = zarr.open(source_store, mode='r')
# Define target and intermediate stores
intermediate_store = 'intermediate.zarr'
target_store = 'target.zarr'
# Define the target chunking scheme (e.g., contiguous in the first dimension)
target_chunks = (10, 5, 5)
# Define maximum memory for each worker (e.g., 256MB)
max_mem = '256MB'
# Create the rechunking plan
rechunked_plan = rechunk(
source,
target_chunks,
max_mem,
target_store,
intermediate_store
)
# Execute the plan
result = rechunked_plan.execute()
print(f"Source array chunks: {source.chunks}")
print(f"Target array chunks: {result.chunks}")
# Clean up example files
import shutil
shutil.rmtree(source_store)
shutil.rmtree(intermediate_store)
shutil.rmtree(target_store)