Data Version Control (DVC)
DVC (Data Version Control) extends Git to handle large files and machine learning pipelines, providing version control for datasets and models, and enabling reproducible ML workflows. It stores data and model files in a cache outside of Git, supporting various remote storage platforms (S3, Azure, Google Cloud, SSH, etc.). The current version is 3.67.1, with frequent releases.
Warnings
- breaking In DVC 3.63.0, the `dvc status --cloud` command changed its behavior for directory targets. It now treats the path as a specific dataset, rather than recursively searching for `.dvc` and `dvc.yaml` files within it.
- breaking Upgrading from DVC 2.x to 3.x involves a change in how file hashes are calculated. This means that a minor change to a file within a DVC-tracked directory can trigger a full migration of the entire directory to the 3.x hashing scheme.
- gotcha It is a common beginner mistake to use `dvc add` too broadly or to run `dvc commit` too frequently alongside every `git commit`. This can lead to tracked pipeline outputs being re-added as data, or excessive cache usage for minor changes.
- gotcha DVC 3.66.0 introduced a restriction on the `pathspec` dependency to `<1`. This could cause conflicts if other tools in your environment (e.g., `black` formatter) required `pathspec>=1`. DVC 3.67.0 subsequently added support for `pathspec v1`.
Install
-
pip install dvc -
pip install "dvc[s3]"
Imports
- dvc.api
import dvc.api
Quickstart
import os
import subprocess
import dvc.api
# --- CLI Setup (normally run in shell) ---
# This part simulates initial DVC project setup if not already done.
# In a real scenario, you'd run these in your terminal.
def setup_dvc_project():
if not os.path.exists('dvc_quickstart_repo'):
os.makedirs('dvc_quickstart_repo')
os.chdir('dvc_quickstart_repo')
if not os.path.exists('.git'):
subprocess.run(['git', 'init', '-b', 'main'], check=True)
# Ensure dvc is initialized
if not os.path.exists('.dvc'):
subprocess.run(['dvc', 'init'], check=True)
subprocess.run(['git', 'add', '.dvcignore', '.dvc/config', '.dvc/.gitignore'], check=True)
subprocess.run(['git', 'commit', '-m', 'Initialize DVC'], check=True)
# Create a dummy data file
with open('data.csv', 'w') as f:
f.write('col1,col2\n1,A\n2,B\n3,C\n')
# Add data to DVC and commit the .dvc file to Git
subprocess.run(['dvc', 'add', 'data.csv'], check=True)
subprocess.run(['git', 'add', 'data.csv.dvc'], check=True)
subprocess.run(['git', 'commit', '-m', 'Add data.csv'], check=True)
print("DVC project setup complete in 'dvc_quickstart_repo'")
os.chdir('..') # Go back to original directory
# Run the setup
setup_dvc_project()
# --- Python API Usage ---
# Now, demonstrate reading the DVC-tracked file programmatically
repo_path = 'dvc_quickstart_repo'
file_path = 'data.csv'
try:
# Read the content of the DVC-tracked file
# dvc.api will automatically handle fetching from cache or remote if needed
content = dvc.api.read(
path=file_path,
repo=repo_path,
rev='HEAD' # Or a specific Git commit/tag/branch
)
print(f"\nContent of {file_path} from DVC repo '{repo_path}':\n{content}")
# Example: Reading a specific parameter from params.yaml if it existed
# (This example assumes no params.yaml is set up in the quickstart for simplicity)
# params_content = dvc.api.read(path='params.yaml', repo=repo_path, rev='HEAD')
# import yaml
# params = yaml.safe_load(params_content)
# print(f"Parameters: {params}")
except Exception as e:
print(f"An error occurred while reading DVC-tracked file: {e}")