dirhash: Directory Hashing Utility
dirhash is a Python module and CLI tool for computing the hash of file system directories based on their structure and content. It supports all hashing algorithms available in Python's `hashlib` module, offers `.gitignore`-style glob/wildcard path matching for filtering files, and leverages multiprocessing for performance. The library computes hashes according to the Dirhash Standard, aiming for consistent and collision-resistant directory hash generation. It is actively maintained with irregular, feature-driven releases, currently at version 0.5.0.
Warnings
- breaking The `pathspec` dependency's upper version limit was removed in `v0.4.0`. This change means that `dirhash` now uses `pathspec` versions greater than `0.10.0`. This update alters how some match/ignore patterns are treated, aligning behavior with `.gitignore` standards. Users upgrading from `v0.3.0` or earlier might observe different hash results for directories with complex filtering patterns.
- breaking Version `0.2.0` introduced 'significant breaking changes' from `v0.1.1`, primarily by adopting the formal 'Dirhash Standard'. This was a fundamental re-implementation that likely affected API calls and internal hash calculation logic.
- deprecated Python 2.7 support was officially dropped in version `0.3.0`. The library now requires Python 3.8 or newer.
- gotcha Windows support in `v0.5.0` is marked as 'experimental'. While `scantree>=0.0.4` was added for this purpose, users on Windows platforms may encounter platform-specific issues.
- gotcha The default behavior regarding symbolic links, empty directories, and which file/directory properties (`name`, `data`, `is_link`) are included in the hash can significantly impact the resulting hash. Misunderstanding these options can lead to inconsistent or unexpected hash values. By default, `name` and `data` are included, but `is_link` is not.
Install
-
pip install dirhash
Imports
- dirhash
from dirhash import dirhash
Quickstart
import os
import tempfile
import shutil
from dirhash import dirhash
# Create a temporary directory structure for demonstration
with tempfile.TemporaryDirectory() as tmpdir:
test_dir = os.path.join(tmpdir, 'my_project')
os.makedirs(os.path.join(test_dir, 'src'))
os.makedirs(os.path.join(test_dir, 'data'))
with open(os.path.join(test_dir, 'src', 'main.py'), 'w') as f:
f.write('print("Hello, dirhash!")')
with open(os.path.join(test_dir, 'data', 'config.json'), 'w') as f:
f.write('{"key": "value"}')
with open(os.path.join(test_dir, '.gitignore'), 'w') as f:
f.write('*.json')
# Calculate the MD5 hash of the entire directory
full_md5_hash = dirhash(test_dir, 'md5')
print(f"MD5 hash of {test_dir}: {full_md5_hash}")
# Calculate SHA1 hash, excluding .json files using .gitignore style patterns
sha1_hash_no_json = dirhash(test_dir, 'sha1', ignore=['*.json'])
print(f"SHA1 hash (excluding *.json): {sha1_hash_no_json}")
# Calculate SHA256 hash, only including .py files
sha256_hash_only_py = dirhash(test_dir, 'sha256', match=['*.py'])
print(f"SHA256 hash (only *.py): {sha256_hash_only_py}")
# Demonstrate including empty directories (default is to exclude if no content included by filters)
# First, a hash without explicitly including empty dirs
empty_dir_path = os.path.join(test_dir, 'empty_folder')
os.makedirs(empty_dir_path)
hash_without_empty = dirhash(test_dir, 'md5')
print(f"MD5 hash (without explicit empty dirs): {hash_without_empty}")
# Now, a hash explicitly including empty dirs
hash_with_empty = dirhash(test_dir, 'md5', empty_dirs=True)
print(f"MD5 hash (with empty dirs): {hash_with_empty}")
# Cleanup is handled by TemporaryDirectory