Big Data Bag Utilities
bdbag is a Python library that extends the BagIt specification (RFC 8493) with features for big data, focusing on FAIR data principles. It enables creation, validation, and manipulation of data bags, supporting checksums, remote payload manifests, and integration with HDF5 and various compression formats. The current version is 1.8.0, and the library is actively maintained with releases as needed for bug fixes and feature enhancements.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: '/path/to/bag/data/nonexistent_file.txt'
cause You tried to add a file to the bag's payload manifest that does not exist at the specified path on the filesystem, or the `data_directory` provided to `make_bag` did not contain the expected files.fixEnsure all files referenced in the bag's payload manifest actually exist at the locations bdbag expects them to be, relative to the `data_directory` or when using `add_file`. -
BagValidationError: Bag is invalid
cause The bag's integrity checks failed. Common causes include missing payload files, incorrect checksums (if files were modified after bag creation), or issues with the bag's tag files (e.g., bag-info.txt, bagit.txt).fixInspect the output or logs provided by `bdbag_api.validate_bag` for specific reasons for invalidation. Missing payload files or checksum mismatches are the most frequent causes. Regenerate checksums if files were altered, or ensure all files are present. -
ImportError: cannot import name 'BDBag' from 'bdbag'
cause This usually happens when attempting to import `BDBag` directly from the `bdbag` module's top level without specifying the correct path, or if `bdbag` itself isn't installed or is installed incorrectly.fixEnsure `bdbag` is correctly installed (`pip install bdbag`). The correct import statement for the `BDBag` class is `from bdbag import BDBag`. If importing high-level functions, use `from bdbag import bdbag_api`.
Warnings
- gotcha bdbag strictly enforces Python version compatibility. It requires Python versions 3.8 through 3.11. Using it with unsupported versions (e.g., Python 3.7 or Python 3.12+) can lead to `ImportError`, `ModuleNotFoundError`, or unexpected runtime errors due to dependency conflicts or syntax incompatibilities.
- gotcha When creating a bag, if you provide a `data_directory` argument to `bdbag_api.make_bag`, bdbag expects this directory to contain the actual data files you want to include. It will then manage moving/linking these files into the bag's internal `data/` directory. If `data_directory` is omitted, `bdbag_api.make_bag` creates an empty bag, and you'll need to manually add files using `bag.add_file()` or similar methods.
- gotcha By default, `bdbag` uses SHA256 checksums for payload files if `checksum_algorithms` is not specified during bag creation. If you need specific checksum algorithms (e.g., MD5, SHA1), always explicitly pass them as a list to the `checksum_algorithms` parameter in `bdbag_api.make_bag`.
Install
-
pip install bdbag
Imports
- BDBag
from bdbag import BDBag
- bdbag_api
import bdbag.api
from bdbag import bdbag_api
Quickstart
import os
import shutil
from bdbag import bdbag_api
# Define paths for the bag
bag_dir = "my_test_bag"
data_dir = os.path.join(bag_dir, "data")
test_file_path = os.path.join(data_dir, "example.txt")
# Clean up previous run if directory exists
if os.path.exists(bag_dir):
shutil.rmtree(bag_dir)
# 1. Create data directory for the bag payload
os.makedirs(data_dir, exist_ok=True)
# 2. Create some data to put into the bag
with open(test_file_path, "w") as f:
f.write("This is some example data for the bdbag.\n")
f.write("It will be bagged and validated.\n")
print(f"Created test data at: {test_file_path}")
try:
# 3. Create the bag
# The data_directory argument tells bdbag where to find the payload files
# and move/link them into the bag's 'data' directory.
bag = bdbag_api.make_bag(bag_dir,
checksum_algorithms=['sha256'],
data_directory=data_dir)
print(f"Bag created successfully at: {bag.path}")
# 4. Validate the bag
is_valid = bdbag_api.validate_bag(bag_dir)
if is_valid:
print(f"Bag '{bag_dir}' is valid.")
else:
print(f"Bag '{bag_dir}' is NOT valid. Check logs for details.")
except Exception as e:
print(f"An error occurred during bag creation or validation: {e}")
finally:
# Optional: Clean up the created bag directory
# Uncomment the line below to remove the directory after inspection
# if os.path.exists(bag_dir):
# shutil.rmtree(bag_dir)
pass