{"id":2940,"library":"dvc-data","title":"DVC Data Management Subsystem","description":"dvc-data is DVC's core data management subsystem, providing functionalities for hashing, indexing, caching, and interacting with various storage backends. As a foundational library for DVC, it receives frequent patch and minor releases to enhance performance, expand compatibility, and fix bugs.","status":"active","version":"3.18.3","language":"en","source_language":"en","source_url":"https://github.com/iterative/dvc-data","tags":["data versioning","data management","dvc","filesystem","hashing","mlops"],"install":[{"cmd":"pip install dvc-data","lang":"bash","label":"Basic Install"},{"cmd":"pip install \"dvc-data[s3]\"","lang":"bash","label":"Install with S3 support"},{"cmd":"pip install \"dvc-data[all]\"","lang":"bash","label":"Install with all cloud support"}],"dependencies":[{"reason":"Provides core object storage abstractions; dvc-data builds upon it.","package":"dvc-objects","optional":false},{"reason":"Abstracts away filesystem operations, used for various storage backends.","package":"fsspec","optional":false},{"reason":"Required for S3 remote storage (installed via `dvc-data[s3]` or `dvc-data[all]`).","package":"boto3","optional":true},{"reason":"Required for Azure remote storage (installed via `dvc-data[azure]` or `dvc-data[all]`).","package":"azure-storage-blob","optional":true},{"reason":"Required for Google Cloud Storage (installed via `dvc-data[gs]` or `dvc-data[all]`).","package":"gcsfs","optional":true}],"imports":[{"note":"Used for managing collections of data entries, often representing directories.","symbol":"DataIndex","correct":"from dvc_data.index import DataIndex"},{"note":"Represents a single file with content-addressable hashing information.","symbol":"HashFile","correct":"from dvc_data.hashfile import HashFile"},{"note":"Manages a database of HashFile objects, similar to Git's object store.","symbol":"HashFileDB","correct":"from dvc_data.hashfile import HashFileDB"}],"quickstart":{"code":"import os\nfrom dvc_data.index import DataIndex, DataIndexEntry\nfrom dvc_data.hashfile.hash import file_md5\n\n# Create a dummy file\nwith open('data.txt', 'w') as f:\n    f.write('hello dvc-data world')\n\n# Calculate its MD5 hash\nmd5_hash = file_md5('data.txt', os.fspath, None)\n\n# Create a DataIndexEntry for the file\nentry = DataIndexEntry(\n    path='data.txt',\n    hash_info={'name': 'md5', 'value': md5_hash}\n)\n\n# Add the entry to a DataIndex\nindex = DataIndex()\nindex[('data.txt',)] = entry\n\nprint(f\"Created DataIndex with entry for 'data.txt': {index.has_entries}\")\nprint(f\"MD5 for 'data.txt': {index[('data.txt',)].hash_info['value']}\")\n\n# Clean up\nos.remove('data.txt')","lang":"python","description":"This quickstart demonstrates the core functionality of `dvc-data` by creating a `DataIndex` and a `DataIndexEntry` for a local file, simulating how DVC tracks data. It uses `dvc_data.hashfile.hash.file_md5` to calculate the MD5 hash and stores it within the index entry."},"warnings":[{"fix":"Run `dvc cache migrate` to move existing DVC 2.x cache data to the 3.x format. Manually remove deprecated keys from `.dvc` files or regenerate them using `dvc add` if issues persist. Ensure consistent line endings in text files across platforms.","message":"DVC 3.0 introduced significant changes to file hashing (removed CRLF conversion) and cache storage locations. Old '.dvc' files containing deprecated keys (like 'metric', 'param', 'plot', 'cmd') will cause validation errors and older cache structures are incompatible.","severity":"breaking","affected_versions":">=3.0.0"},{"fix":"Upgrade to `dvc-data` version 3.18.3 or newer to benefit from fixes for `bulk_exists` behavior across various filesystems and scenarios.","message":"The `bulk_exists` method in earlier 3.x versions had issues with filesystems not implementing `ls` or when handling duplicate hashes, potentially leading to incorrect results during data existence checks.","severity":"gotcha","affected_versions":"<3.18.3"},{"fix":"Upgrade to `dvc-data` version 3.18.3 or newer to ensure correct `protocol` propagation in metadata operations.","message":"Earlier versions might mishandle protocol information when filtering changed metadata in `_filter_changed`, potentially leading to inaccurate detection of data changes, especially with custom protocols.","severity":"gotcha","affected_versions":"<3.18.3"},{"fix":"Increase the operating system's open file descriptor limit (`ulimit -n` on UNIX-like systems) or use a lower value for the `--jobs` parameter.","message":"When working with DVC commands like `dvc pull`, `fetch`, or `push` (which rely on `dvc-data`), users might encounter '[Errno 24] Too many open files' errors, especially on macOS with S3 remotes, when using many `--jobs`.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}