Deep Lake

4.5.10 · active · verified Thu Apr 16

Deep Lake is a Python library for building, managing, and querying multi-modal datasets for AI. It enables storing and streaming data (images, videos, audio, text, embeddings) directly from cloud storage to machine learning models, supporting various operations like version control, indexing, and complex queries. As of version 4.5.10, it features a C++ core for enhanced performance and offers robust data management for AI workflows. The project is actively developed with frequent minor releases.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a new Deep Lake dataset, define its schema with tensors for images and labels, append synthetic data, and then load and query the dataset. It shows how to handle authentication via environment variables (recommended for non-interactive use) and provides options for local or cloud storage paths.

import deeplake
import numpy as np
import os

# Authenticate to Deep Lake Hub (optional for local, required for cloud storage)
# For cloud storage, ensure DEEPLAKE_TOKEN is set as an environment variable or use deeplake.login()
# DEEPLAKE_TOKEN = os.environ.get('DEEPLAKE_TOKEN', '')
# if DEEPLAKE_TOKEN:
#     deeplake.login(token=DEEPLAKE_TOKEN)

# Use a local path for quick testing without authentication, or hub:// for cloud
ds_path = os.environ.get("DEEPLAKE_PATH", "./my_local_dataset") 
# For cloud: hub_path = os.environ.get("DEEPLAKE_CLOUD_PATH", "hub://activeloop/quickstart-test")

# Create an empty dataset or overwrite existing one
ds = deeplake.empty(ds_path, overwrite=True)

# Define schema and append data within a 'with' block
with ds:
    ds.create_tensor('images', htype='image', sample_compression='jpeg')
    ds.create_tensor('labels', htype='class_label')
    
    for i in range(5):
        # Append random image and label data
        ds.images.append(np.random.rand(64, 64, 3) * 255)
        ds.labels.append(i % 2)

print(f"Dataset created at {ds_path} with {len(ds)} samples.")

# Load the dataset
ds_loaded = deeplake.load(ds_path)

# Query and access data
print(f"Loaded dataset has {len(ds_loaded)} samples.")
print(ds_loaded.summary())

# Access a sample
first_image = ds_loaded.images[0].numpy()
first_label = ds_loaded.labels[0].numpy()
print(f"First image shape: {first_image.shape}, First label: {first_label}")

view raw JSON →