Icechunk

2.0.1 · active · verified Tue Apr 14

Icechunk is an open-source (Apache 2.0), transactional storage engine for tensor / ND-array data, designed for use on cloud object storage. It augments the Zarr core data model with features that enhance performance, collaboration, and safety in a multi-user cloud-computing context. The library is currently at version 2.0.1 and follows a versioning scheme where major versions align with the on-disk format, allowing for breaking API changes even in minor releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a local Icechunk repository, interact with it using Zarr through a writable session, commit changes, and then make further modifications requiring a new session. It also shows how to review the repository's commit history.

import icechunk as ic
import zarr
import numpy as np
import tempfile
import os

# Create a temporary directory for the local repository
temp_dir = tempfile.TemporaryDirectory()
repo_path = os.path.join(temp_dir.name, "my_icechunk_repo")

try:
    # 1. Create a new Icechunk repository on the local filesystem
    storage = ic.local_filesystem_storage(repo_path)
    repo = ic.Repository.create(storage)
    print(f"Repository created at: {repo_path}")

    # 2. Create a writable session on the 'main' branch
    session = repo.writable_session("main")

    # 3. Access the Zarr store from the session
    store = session.store # A zarr store

    # 4. Use Zarr to create a group and an array
    root = zarr.group(store=store)
    data = np.arange(1000).reshape(10, 10, 10)
    zarr_array = root.create_array(
        'my_data',
        shape=data.shape,
        dtype=data.dtype,
        chunks=(5, 5, 5)
    )
    zarr_array[:] = data

    # 5. Commit the changes
    snapshot_id = session.commit("Initial data commit")
    print(f"First commit successful with snapshot ID: {snapshot_id}")

    # A new session is required for further writes after a commit
    session_2 = repo.writable_session("main")
    store_2 = session_2.store
    zarr_array_2 = zarr.open_array(store_2, 'my_data', mode='r+')
    zarr_array_2[:5, :5, :5] = 999 # Overwrite a subset
    snapshot_id_2 = session_2.commit("Overwrite some values")
    print(f"Second commit successful with snapshot ID: {snapshot_id_2}")

    # 6. Explore version history
    print("\nRepository history:")
    for snapshot in repo.log("main"):
        print(f"  ID: {snapshot.id}, Message: {snapshot.commit_message}")

finally:
    # Clean up the temporary directory
    temp_dir.cleanup()
    print(f"\nCleaned up temporary directory: {temp_dir.name}")

view raw JSON →