Blosc2
Blosc2 is a high-performance compressed ndarray library for Python, using the C-Blosc2 compression backend. It provides efficient storage and manipulation of arbitrarily large N-dimensional datasets, following the Array API standard, and includes a flexible compute engine for complex calculations on compressed data. Currently at version 4.1.2, it maintains an active development pace with frequent updates and feature enhancements.
Warnings
- breaking Buffers generated with C-Blosc2 are generally not format-compatible with C-Blosc1 (i.e., forward compatibility is not supported). While C-Blosc2 is backward compatible with the C-Blosc1 API and in-memory format, users upgrading or sharing data between versions should be aware of this limitation.
- breaking The `NDArray.size` property changed its behavior in version 3.11.0. It now returns the number of elements in the array (Array API standard compliant) instead of the size of the array in bytes. Code relying on `NDArray.size` for byte size will need to be updated.
- breaking The `blosc2.concatenate()` function was renamed to `blosc2.concat()` in version 3.5.0 to align with the Array API. While `concatenate` is still available for backward compatibility, it will be removed in a future release.
- gotcha When specifying compression codecs or filters, users must pass members of the `blosc2.Codec` and `blosc2.Filter` enums, respectively, not string literals. Passing strings will result in an `AttributeError`.
- gotcha For in-memory tasks, Blosc2's overhead can sometimes make it slower than pure NumPy/Numexpr, especially on x86 CPUs. However, it consistently outperforms them for on-disk operations or on modern ARM architectures (e.g., Apple Silicon) due to its efficient use of compression and cache optimization.
- gotcha When using Blosc2 as an HDF5 filter, it is important not to activate the shuffle filter directly within HDF5. Blosc2 uses an internal SIMD shuffle that is much faster and should be handled by Blosc2 itself for optimal performance.
Install
-
pip install blosc2 --upgrade -
conda install -c conda-forge python-blosc2
Imports
- blosc2
import blosc2
- NDArray
import blosc2 array = blosc2.zeros((10, 10))
- TreeStore
from blosc2 import TreeStore
- Codec
blosc2.Codec.BLOSCLZ
- Filter
blosc2.Filter.SHUFFLE
Quickstart
import blosc2
import numpy as np
# Create a Blosc2 NDArray from a NumPy array
data = np.arange(1_000_000, dtype=np.float64)
ndarray = blosc2.asarray(data)
print(f"Original data size: {data.nbytes / (1024**2):.2f} MB")
print(f"Compressed data size: {ndarray.nbytes / (1024**2):.2f} MB")
# Perform a computation (e.g., sum) on the compressed array
computed_sum = ndarray.sum()
print(f"Sum of array elements: {computed_sum}")
# Decompress the array back to a NumPy array
decompressed_data = ndarray[:]
assert np.allclose(data, decompressed_data)
print("Data compressed and decompressed successfully.")
# Example of using TreeStore for hierarchical data storage
with blosc2.TreeStore("my_data.b2z", mode="w") as ts:
ts["/group1/dataset_a"] = np.random.rand(100, 100)
ts["/group2/dataset_b"] = blosc2.zeros((50, 50), dtype=np.int32)
print("Data stored in TreeStore 'my_data.b2z'.")
with blosc2.TreeStore("my_data.b2z", mode="r") as ts:
ds_a = ts["/group1/dataset_a"]
ds_b = ts["/group2/dataset_b"]
print(f"Read dataset_a shape: {ds_a.shape}")
print(f"Read dataset_b dtype: {ds_b.dtype}")
os.remove("my_data.b2z")