DVC S3 Remote Plugin
dvc-s3 is a plugin for Data Version Control (DVC) that enables storing and retrieving data, models, and pipelines from Amazon S3. It integrates seamlessly with DVC's CLI and API to manage datasets on S3. The current version is 3.3.0, and it follows a minor release cadence driven by DVC's core development.
Warnings
- breaking The `boto3` library is no longer a direct dependency as of version 3.3.0. If your project implicitly relied on `boto3` being installed alongside `dvc-s3` for other S3-related operations, you will now need to install it explicitly (e.g., `pip install boto3`).
- gotcha The `dvc-s3` plugin must be installed separately from `dvc`. Installing just `dvc` will not provide S3 remote support. Always use `pip install dvc dvc-s3` to ensure S3 functionality.
- gotcha AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) and S3 bucket configuration are crucial. Incorrectly set credentials or a non-existent/inaccessible bucket are common causes of errors when attempting to access S3 remotes.
- breaking Version 3.2.1 and later require `s3fs>=2024.12.0`. If you are running an older version of `s3fs`, you must upgrade to ensure compatibility and stability when using `dvc-s3`.
- gotcha Explicitly setting the S3 region for your remote is often good practice to avoid ambiguity or issues with AWS endpoint resolution, especially when working across different regions or with specific AWS configurations.
Install
-
pip install dvc dvc-s3
Quickstart
import os
import subprocess
import shutil
# Ensure dvc and dvc-s3 are installed: `pip install dvc dvc-s3`
# --- Configuration for S3 (replace with your actual details) ---
# For this example to work with a real S3 bucket, you need valid AWS credentials
# and an S3 bucket. It's recommended to set them as environment variables:
# export AWS_ACCESS_KEY_ID='AKIA...'
# export AWS_SECRET_ACCESS_KEY='YOUR_SECRET_KEY'
# export DVC_S3_BUCKET='your-dvc-test-bucket-name'
# export AWS_DEFAULT_REGION='us-east-1'
aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID_PLACEHOLDER')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY_PLACEHOLDER')
s3_bucket_name = os.environ.get('DVC_S3_BUCKET', 'your-dvc-test-bucket-name')
s3_region = os.environ.get('AWS_DEFAULT_REGION', 'us-east-1')
if 'YOUR_AWS_ACCESS_KEY_ID_PLACEHOLDER' in aws_access_key_id or 'your-dvc-test-bucket-name' in s3_bucket_name:
print("\nWARNING: Please set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DVC_S3_BUCKET, and AWS_DEFAULT_REGION environment variables for the quickstart to interact with real S3.")
print("Proceeding with placeholder values, push will likely fail.")
project_dir = "dvc-s3-quickstart"
# Clean up previous runs if any (optional, uncomment if needed for repeated runs)
# if os.path.exists(project_dir):
# shutil.rmtree(project_dir)
os.makedirs(project_dir, exist_ok=True)
os.chdir(project_dir)
try:
# 1. Initialize DVC repository (without Git for simplicity)
print("\n1. Initializing DVC repository...")
subprocess.run(["dvc", "init", "--no-scm"], check=True)
# 2. Create a dummy data directory and file
os.makedirs("data", exist_ok=True)
with open("data/my_data.txt", "w") as f:
f.write("Hello, DVC and S3!")
# 3. Add S3 remote
print(f"\n3. Adding S3 remote 'my_s3_remote' to s3://{s3_bucket_name}/dvc-store")
subprocess.run(["dvc", "remote", "add", "-d", "my_s3_remote", f"s3://{s3_bucket_name}/dvc-store"], check=True)
subprocess.run(["dvc", "remote", "modify", "my_s3_remote", "region", s3_region], check=True)
# 4. Add data to DVC
print("\n4. Adding 'data/my_data.txt' to DVC...")
subprocess.run(["dvc", "add", "data/my_data.txt"], check=True)
# 5. Push data to the S3 remote
print("\n5. Pushing data to S3 remote...")
subprocess.run(["dvc", "push"], check=True)
print("\nQuickstart completed! Check your S3 bucket for the DVC store.")
except subprocess.CalledProcessError as e:
print(f"\nERROR: DVC command failed.\nCommand: {' '.join(e.cmd)}\nOutput:\n{e.stdout.decode()}\n{e.stderr.decode()}")
print("Please ensure DVC and dvc-s3 are installed, AWS credentials are set, and the S3 bucket exists and is writable.")
except FileNotFoundError:
print("ERROR: 'dvc' command not found. Please ensure DVC is installed and in your PATH.")
finally:
os.chdir("..") # Return to original directory
# Optional: Clean up the created project directory
# shutil.rmtree(project_dir)