DVC Data Management Subsystem
dvc-data is DVC's core data management subsystem, providing functionalities for hashing, indexing, caching, and interacting with various storage backends. As a foundational library for DVC, it receives frequent patch and minor releases to enhance performance, expand compatibility, and fix bugs.
Warnings
- breaking DVC 3.0 introduced significant changes to file hashing (removed CRLF conversion) and cache storage locations. Old '.dvc' files containing deprecated keys (like 'metric', 'param', 'plot', 'cmd') will cause validation errors and older cache structures are incompatible.
- gotcha The `bulk_exists` method in earlier 3.x versions had issues with filesystems not implementing `ls` or when handling duplicate hashes, potentially leading to incorrect results during data existence checks.
- gotcha Earlier versions might mishandle protocol information when filtering changed metadata in `_filter_changed`, potentially leading to inaccurate detection of data changes, especially with custom protocols.
- gotcha When working with DVC commands like `dvc pull`, `fetch`, or `push` (which rely on `dvc-data`), users might encounter '[Errno 24] Too many open files' errors, especially on macOS with S3 remotes, when using many `--jobs`.
Install
-
pip install dvc-data -
pip install "dvc-data[s3]" -
pip install "dvc-data[all]"
Imports
- DataIndex
from dvc_data.index import DataIndex
- HashFile
from dvc_data.hashfile import HashFile
- HashFileDB
from dvc_data.hashfile import HashFileDB
Quickstart
import os
from dvc_data.index import DataIndex, DataIndexEntry
from dvc_data.hashfile.hash import file_md5
# Create a dummy file
with open('data.txt', 'w') as f:
f.write('hello dvc-data world')
# Calculate its MD5 hash
md5_hash = file_md5('data.txt', os.fspath, None)
# Create a DataIndexEntry for the file
entry = DataIndexEntry(
path='data.txt',
hash_info={'name': 'md5', 'value': md5_hash}
)
# Add the entry to a DataIndex
index = DataIndex()
index[('data.txt',)] = entry
print(f"Created DataIndex with entry for 'data.txt': {index.has_entries}")
print(f"MD5 for 'data.txt': {index[('data.txt',)].hash_info['value']}")
# Clean up
os.remove('data.txt')