DVC Data Management Subsystem

3.18.3 · active · verified Sat Apr 11

dvc-data is DVC's core data management subsystem, providing functionalities for hashing, indexing, caching, and interacting with various storage backends. As a foundational library for DVC, it receives frequent patch and minor releases to enhance performance, expand compatibility, and fix bugs.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core functionality of `dvc-data` by creating a `DataIndex` and a `DataIndexEntry` for a local file, simulating how DVC tracks data. It uses `dvc_data.hashfile.hash.file_md5` to calculate the MD5 hash and stores it within the index entry.

import os
from dvc_data.index import DataIndex, DataIndexEntry
from dvc_data.hashfile.hash import file_md5

# Create a dummy file
with open('data.txt', 'w') as f:
    f.write('hello dvc-data world')

# Calculate its MD5 hash
md5_hash = file_md5('data.txt', os.fspath, None)

# Create a DataIndexEntry for the file
entry = DataIndexEntry(
    path='data.txt',
    hash_info={'name': 'md5', 'value': md5_hash}
)

# Add the entry to a DataIndex
index = DataIndex()
index[('data.txt',)] = entry

print(f"Created DataIndex with entry for 'data.txt': {index.has_entries}")
print(f"MD5 for 'data.txt': {index[('data.txt',)].hash_info['value']}")

# Clean up
os.remove('data.txt')

view raw JSON →