DVC S3 Remote Plugin

3.3.0 · active · verified Sat Apr 11

dvc-s3 is a plugin for Data Version Control (DVC) that enables storing and retrieving data, models, and pipelines from Amazon S3. It integrates seamlessly with DVC's CLI and API to manage datasets on S3. The current version is 3.3.0, and it follows a minor release cadence driven by DVC's core development.

Warnings

Install

Quickstart

This quickstart demonstrates how to initialize a DVC project, configure an S3 remote, add a data file to DVC, and push it to your S3 bucket. Ensure you have `dvc` and `dvc-s3` installed, and your AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) and desired S3 bucket name (DVC_S3_BUCKET) are set as environment variables.

import os
import subprocess
import shutil

# Ensure dvc and dvc-s3 are installed: `pip install dvc dvc-s3`

# --- Configuration for S3 (replace with your actual details) ---
# For this example to work with a real S3 bucket, you need valid AWS credentials
# and an S3 bucket. It's recommended to set them as environment variables:
# export AWS_ACCESS_KEY_ID='AKIA...'
# export AWS_SECRET_ACCESS_KEY='YOUR_SECRET_KEY'
# export DVC_S3_BUCKET='your-dvc-test-bucket-name'
# export AWS_DEFAULT_REGION='us-east-1'

aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID', 'YOUR_AWS_ACCESS_KEY_ID_PLACEHOLDER')
aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY', 'YOUR_AWS_SECRET_ACCESS_KEY_PLACEHOLDER')
s3_bucket_name = os.environ.get('DVC_S3_BUCKET', 'your-dvc-test-bucket-name')
s3_region = os.environ.get('AWS_DEFAULT_REGION', 'us-east-1')

if 'YOUR_AWS_ACCESS_KEY_ID_PLACEHOLDER' in aws_access_key_id or 'your-dvc-test-bucket-name' in s3_bucket_name:
    print("\nWARNING: Please set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, DVC_S3_BUCKET, and AWS_DEFAULT_REGION environment variables for the quickstart to interact with real S3.")
    print("Proceeding with placeholder values, push will likely fail.")

project_dir = "dvc-s3-quickstart"

# Clean up previous runs if any (optional, uncomment if needed for repeated runs)
# if os.path.exists(project_dir):
#     shutil.rmtree(project_dir)

os.makedirs(project_dir, exist_ok=True)
os.chdir(project_dir)

try:
    # 1. Initialize DVC repository (without Git for simplicity)
    print("\n1. Initializing DVC repository...")
    subprocess.run(["dvc", "init", "--no-scm"], check=True)

    # 2. Create a dummy data directory and file
    os.makedirs("data", exist_ok=True)
    with open("data/my_data.txt", "w") as f:
        f.write("Hello, DVC and S3!")

    # 3. Add S3 remote
    print(f"\n3. Adding S3 remote 'my_s3_remote' to s3://{s3_bucket_name}/dvc-store")
    subprocess.run(["dvc", "remote", "add", "-d", "my_s3_remote", f"s3://{s3_bucket_name}/dvc-store"], check=True)
    subprocess.run(["dvc", "remote", "modify", "my_s3_remote", "region", s3_region], check=True)

    # 4. Add data to DVC
    print("\n4. Adding 'data/my_data.txt' to DVC...")
    subprocess.run(["dvc", "add", "data/my_data.txt"], check=True)

    # 5. Push data to the S3 remote
    print("\n5. Pushing data to S3 remote...")
    subprocess.run(["dvc", "push"], check=True)
    print("\nQuickstart completed! Check your S3 bucket for the DVC store.")

except subprocess.CalledProcessError as e:
    print(f"\nERROR: DVC command failed.\nCommand: {' '.join(e.cmd)}\nOutput:\n{e.stdout.decode()}\n{e.stderr.decode()}")
    print("Please ensure DVC and dvc-s3 are installed, AWS credentials are set, and the S3 bucket exists and is writable.")
except FileNotFoundError:
    print("ERROR: 'dvc' command not found. Please ensure DVC is installed and in your PATH.")
finally:
    os.chdir("..") # Return to original directory
    # Optional: Clean up the created project directory
    # shutil.rmtree(project_dir)

view raw JSON →