Icechunk
Icechunk is an open-source (Apache 2.0), transactional storage engine for tensor / ND-array data, designed for use on cloud object storage. It augments the Zarr core data model with features that enhance performance, collaboration, and safety in a multi-user cloud-computing context. The library is currently at version 2.0.1 and follows a versioning scheme where major versions align with the on-disk format, allowing for breaking API changes even in minor releases.
Warnings
- breaking Icechunk 2.0.0 and later requires Python 3.12 or higher. Support for Python 3.11 was dropped.
- breaking The on-disk storage format changed with Icechunk 2.0.0. Existing repositories created with Icechunk 1.x must be migrated using the `upgrade_icechunk_repository()` function. This is an administrative operation and must be executed in isolation (no other readers/writers).
- breaking Enums like `ChunkType` had their variants renamed from `UPPER_CASE` to `snake_case` (e.g., `ChunkType.INLINE` became `ChunkType.inline`).
- gotcha After a `writable_session.commit()` is successfully executed, that session becomes read-only. To make further changes and commit them, you must create a new `writable_session`.
- gotcha Concurrent creation of an Icechunk repository in the same location from multiple processes is not safe.
- gotcha Icechunk's version policy allows breaking API changes to occur in minor releases (e.g., `2.0.0` to `2.1.0`), not just major versions, to align library versions with the on-disk format.
Install
-
pip install icechunk
Imports
- Repository
from icechunk import Repository
- s3_storage
from icechunk import s3_storage
Quickstart
import icechunk as ic
import zarr
import numpy as np
import tempfile
import os
# Create a temporary directory for the local repository
temp_dir = tempfile.TemporaryDirectory()
repo_path = os.path.join(temp_dir.name, "my_icechunk_repo")
try:
# 1. Create a new Icechunk repository on the local filesystem
storage = ic.local_filesystem_storage(repo_path)
repo = ic.Repository.create(storage)
print(f"Repository created at: {repo_path}")
# 2. Create a writable session on the 'main' branch
session = repo.writable_session("main")
# 3. Access the Zarr store from the session
store = session.store # A zarr store
# 4. Use Zarr to create a group and an array
root = zarr.group(store=store)
data = np.arange(1000).reshape(10, 10, 10)
zarr_array = root.create_array(
'my_data',
shape=data.shape,
dtype=data.dtype,
chunks=(5, 5, 5)
)
zarr_array[:] = data
# 5. Commit the changes
snapshot_id = session.commit("Initial data commit")
print(f"First commit successful with snapshot ID: {snapshot_id}")
# A new session is required for further writes after a commit
session_2 = repo.writable_session("main")
store_2 = session_2.store
zarr_array_2 = zarr.open_array(store_2, 'my_data', mode='r+')
zarr_array_2[:5, :5, :5] = 999 # Overwrite a subset
snapshot_id_2 = session_2.commit("Overwrite some values")
print(f"Second commit successful with snapshot ID: {snapshot_id_2}")
# 6. Explore version history
print("\nRepository history:")
for snapshot in repo.log("main"):
print(f" ID: {snapshot.id}, Message: {snapshot.commit_message}")
finally:
# Clean up the temporary directory
temp_dir.cleanup()
print(f"\nCleaned up temporary directory: {temp_dir.name}")