Deep Lake
Deep Lake is a Python library for building, managing, and querying multi-modal datasets for AI. It enables storing and streaming data (images, videos, audio, text, embeddings) directly from cloud storage to machine learning models, supporting various operations like version control, indexing, and complex queries. As of version 4.5.10, it features a C++ core for enhanced performance and offers robust data management for AI workflows. The project is actively developed with frequent minor releases.
Common errors
-
Hub API token not provided. Please provide a token through `deeplake.login()` or by setting the `DEEPLAKE_TOKEN` environment variable.
cause Attempting to access a Deep Lake Hub dataset without proper authentication.fixRun `deeplake.login()` interactively, or set the `DEEPLAKE_TOKEN` environment variable to your Activeloop token before running your script. -
Dataset not found. Please check the path and permissions.
cause The specified dataset path (local or hub://) does not exist, or the user lacks read/write permissions.fixVerify the dataset path is correct. For Hub datasets, ensure the token has access to the specified path. For local datasets, check file system permissions. -
TypeError: object of type 'Tensor' has no len()
cause Attempting to call `len()` directly on a Deep Lake Tensor object instead of the dataset or a specific tensor property.fixTo get the number of samples in a dataset, use `len(ds)`. To get the length of a specific tensor (number of samples it contains), use `len(ds.tensor_name)`. -
ValueError: Mismatch in data type. Expected 'image', got 'video' for tensor 'my_tensor'.
cause Attempting to append data of a different `htype` than what the tensor was initialized with or inferred.fixEnsure the `htype` of the tensor matches the type of data you are appending. If you need to store different types, create separate tensors or define a more generic `htype` if applicable, or define a flexible schema.
Warnings
- breaking Deep Lake v3.0 introduced significant API changes, including the reorganization of dataset creation functions and tensor access patterns.
- breaking The `activeloop` package and its authentication methods are deprecated in favor of `deeplake.login()` and environment variables.
- gotcha Deep Lake datasets are designed for efficient streaming; loading an entire tensor into memory (`tensor.numpy()`) for very large tensors can lead to OOM errors.
- gotcha Dataset paths for Hub (cloud) storage must include `hub://` prefix and a valid organization/username, e.g., `hub://org_name/dataset_name`.
- gotcha When appending data to a dataset, ensure the data type and shape are consistent with the tensor's `htype` and inferred dimensions, or explicitly define the schema.
Install
-
pip install deeplake
Imports
- deeplake
import deeplake
- VectorStore
from activeloop.vectorstore import VectorStore
from deeplake.vectorstore import VectorStore
- empty
deeplake.dataset.create_empty(...)
deeplake.empty(...)
Quickstart
import deeplake
import numpy as np
import os
# Authenticate to Deep Lake Hub (optional for local, required for cloud storage)
# For cloud storage, ensure DEEPLAKE_TOKEN is set as an environment variable or use deeplake.login()
# DEEPLAKE_TOKEN = os.environ.get('DEEPLAKE_TOKEN', '')
# if DEEPLAKE_TOKEN:
# deeplake.login(token=DEEPLAKE_TOKEN)
# Use a local path for quick testing without authentication, or hub:// for cloud
ds_path = os.environ.get("DEEPLAKE_PATH", "./my_local_dataset")
# For cloud: hub_path = os.environ.get("DEEPLAKE_CLOUD_PATH", "hub://activeloop/quickstart-test")
# Create an empty dataset or overwrite existing one
ds = deeplake.empty(ds_path, overwrite=True)
# Define schema and append data within a 'with' block
with ds:
ds.create_tensor('images', htype='image', sample_compression='jpeg')
ds.create_tensor('labels', htype='class_label')
for i in range(5):
# Append random image and label data
ds.images.append(np.random.rand(64, 64, 3) * 255)
ds.labels.append(i % 2)
print(f"Dataset created at {ds_path} with {len(ds)} samples.")
# Load the dataset
ds_loaded = deeplake.load(ds_path)
# Query and access data
print(f"Loaded dataset has {len(ds_loaded)} samples.")
print(ds_loaded.summary())
# Access a sample
first_image = ds_loaded.images[0].numpy()
first_label = ds_loaded.labels[0].numpy()
print(f"First image shape: {first_image.shape}, First label: {first_label}")