{"id":2941,"library":"dvc","title":"Data Version Control (DVC)","description":"DVC (Data Version Control) extends Git to handle large files and machine learning pipelines, providing version control for datasets and models, and enabling reproducible ML workflows. It stores data and model files in a cache outside of Git, supporting various remote storage platforms (S3, Azure, Google Cloud, SSH, etc.). The current version is 3.67.1, with frequent releases.","status":"active","version":"3.67.1","language":"en","source_language":"en","source_url":"https://github.com/iterative/dvc","tags":["data versioning","MLOps","data management","reproducibility","machine learning","git"],"install":[{"cmd":"pip install dvc","lang":"bash","label":"Core DVC"},{"cmd":"pip install \"dvc[s3]\"","lang":"bash","label":"With S3 support (add other extras like [gs], [azure], [ssh], or [all])"}],"dependencies":[{"reason":"Requires Python >=3.9","package":"python","optional":false},{"reason":"Core DVC component for data handling","package":"dvc-data","optional":false},{"reason":"Required for `.dvcignore` and file pattern matching.","package":"pathspec","optional":false},{"reason":"Needed for `dvc version` to report file system information accurately.","package":"psutil","optional":true}],"imports":[{"note":"The primary module for programmatic interaction with DVC-tracked data and experiments.","symbol":"dvc.api","correct":"import dvc.api"}],"quickstart":{"code":"import os\nimport subprocess\nimport dvc.api\n\n# --- CLI Setup (normally run in shell) ---\n# This part simulates initial DVC project setup if not already done.\n# In a real scenario, you'd run these in your terminal.\n\ndef setup_dvc_project():\n    if not os.path.exists('dvc_quickstart_repo'):\n        os.makedirs('dvc_quickstart_repo')\n    os.chdir('dvc_quickstart_repo')\n\n    if not os.path.exists('.git'):\n        subprocess.run(['git', 'init', '-b', 'main'], check=True)\n    \n    # Ensure dvc is initialized\n    if not os.path.exists('.dvc'):\n        subprocess.run(['dvc', 'init'], check=True)\n    subprocess.run(['git', 'add', '.dvcignore', '.dvc/config', '.dvc/.gitignore'], check=True)\n    subprocess.run(['git', 'commit', '-m', 'Initialize DVC'], check=True)\n\n    # Create a dummy data file\n    with open('data.csv', 'w') as f:\n        f.write('col1,col2\\n1,A\\n2,B\\n3,C\\n')\n    \n    # Add data to DVC and commit the .dvc file to Git\n    subprocess.run(['dvc', 'add', 'data.csv'], check=True)\n    subprocess.run(['git', 'add', 'data.csv.dvc'], check=True)\n    subprocess.run(['git', 'commit', '-m', 'Add data.csv'], check=True)\n    \n    print(\"DVC project setup complete in 'dvc_quickstart_repo'\")\n    os.chdir('..') # Go back to original directory\n\n# Run the setup\nsetup_dvc_project()\n\n# --- Python API Usage ---\n# Now, demonstrate reading the DVC-tracked file programmatically\nrepo_path = 'dvc_quickstart_repo'\nfile_path = 'data.csv'\n\ntry:\n    # Read the content of the DVC-tracked file\n    # dvc.api will automatically handle fetching from cache or remote if needed\n    content = dvc.api.read(\n        path=file_path,\n        repo=repo_path,\n        rev='HEAD' # Or a specific Git commit/tag/branch\n    )\n    print(f\"\\nContent of {file_path} from DVC repo '{repo_path}':\\n{content}\")\n\n    # Example: Reading a specific parameter from params.yaml if it existed\n    # (This example assumes no params.yaml is set up in the quickstart for simplicity)\n    # params_content = dvc.api.read(path='params.yaml', repo=repo_path, rev='HEAD')\n    # import yaml\n    # params = yaml.safe_load(params_content)\n    # print(f\"Parameters: {params}\")\n\nexcept Exception as e:\n    print(f\"An error occurred while reading DVC-tracked file: {e}\")\n\n","lang":"python","description":"This quickstart first sets up a minimal DVC project using shell commands (simulated via `subprocess`) to initialize DVC within a Git repository and track a `data.csv` file. It then demonstrates how to use the `dvc.api.read()` function in Python to programmatically access the content of the DVC-tracked file."},"warnings":[{"fix":"To restore the previous recursive behavior when checking a directory with `dvc status --cloud`, add the `--recursive` option.","message":"In DVC 3.63.0, the `dvc status --cloud` command changed its behavior for directory targets. It now treats the path as a specific dataset, rather than recursively searching for `.dvc` and `dvc.yaml` files within it.","severity":"breaking","affected_versions":">=3.63.0"},{"fix":"After migrating your local repository to 3.x (e.g., using `dvc cache migrate --dvc-files`), you may need to re-upload all 3.x data to your remote storage for consistency. Consider the impact on remote storage and network usage.","message":"Upgrading from DVC 2.x to 3.x involves a change in how file hashes are calculated. This means that a minor change to a file within a DVC-tracked directory can trigger a full migration of the entire directory to the 3.x hashing scheme.","severity":"breaking","affected_versions":"3.x (from 2.x)"},{"fix":"Use `dvc add` for initial data versioning and `dvc commit` only when data or pipelines are in a stable, significant state. For intermediate or pipeline-generated outputs, rely on DVC's pipeline system (`dvc.yaml`) which handles caching automatically. Use the `--no-commit` option with `dvc add` or `dvc run` if you want to track data without immediately caching it.","message":"It is a common beginner mistake to use `dvc add` too broadly or to run `dvc commit` too frequently alongside every `git commit`. This can lead to tracked pipeline outputs being re-added as data, or excessive cache usage for minor changes.","severity":"gotcha","affected_versions":"All"},{"fix":"If encountering `pathspec` dependency conflicts around DVC 3.66.0, upgrade to DVC 3.67.0 or later, which includes `pathspec v1` support, or adjust your environment to ensure compatible `pathspec` versions for all tools.","message":"DVC 3.66.0 introduced a restriction on the `pathspec` dependency to `<1`. This could cause conflicts if other tools in your environment (e.g., `black` formatter) required `pathspec>=1`. DVC 3.67.0 subsequently added support for `pathspec v1`.","severity":"gotcha","affected_versions":"3.66.0"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}