FastCDC

raw JSON →
1.7.0 verified Fri May 01 auth: no python

FastCDC is a pure Python implementation of the Fast Content-Defined Chunking (CDC) algorithm, providing variable-size chunking based on content similarity. Current version is 1.7.0, supporting Python 3.7+.

pip install fastcdc
error ModuleNotFoundError: No module named 'fastcdc'
cause fastcdc is not installed.
fix
Run 'pip install fastcdc'.
error TypeError: 'FastCDC' object is not callable
cause Trying to call the class instance directly instead of using .chunk() method.
fix
Use 'cdc.chunk(data)' instead of 'cdc(data)'.
gotcha The chunk() method returns an iterator of tuples (offset, length, hash). The chunk data must be fetched separately from the original bytes via data[offset:offset+length].
fix Use data[offset:offset+length] to get chunk bytes after receiving offset and length.
gotcha The hash object is a bytes object of length 8 (64-bit) by default. This may not be suitable for deduplication at scale; consider using a custom hash function or additional hashing.
fix For stronger hashing, pass a custom hash function to FastCDC or re-hash the chunk bytes with SHA-256.

Demonstrates basic usage: instantiate FastCDC with average chunk size, then chunk bytes.

from fastcdc import FastCDC

# Create a FastCDC instance with desired average chunk size
cdc = FastCDC(avg_size=4096)  # 4KB average chunk size

# Chunk a bytes object
data = b"""Some large binary data repeated many times to demonstrate chunking.""" * 100
chunks = list(cdc.chunk(data))

print(f"Number of chunks: {len(chunks)}")
for offset, length, chunk_hash in chunks[:3]:
    print(f"Offset: {offset}, Length: {length}, Hash: {chunk_hash.hex()}")