Daft Distributed Dataframes

0.7.9 · active · verified Tue Apr 14

Daft is a high-performance data engine for AI and multimodal workloads. It provides a Python DataFrame API for processing images, audio, video, and structured data at any scale, built with Rust under the hood for performance and seamless scaling from local to distributed clusters. It is actively developed by Eventual Inc., with frequent releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates loading a dataset from Hugging Face, inspecting the lazy DataFrame schema, selecting columns, and materializing a small subset into a Pandas DataFrame to view the data.

import daft
import os

# Load an e-commerce dataset from Hugging Face
# Requires 'daft[huggingface]' to be installed if not already part of your setup
df = daft.read_huggingface("calmgoose/amazon-product-data-2020")

# Inspect the schema (Daft is lazy, so no data is fetched yet)
print("Schema:")
print(df)

# Select a few columns and materialize the first 5 rows to see data
df_subset = df.select(df["Product Name"], df["Category"]).limit(5)
result = df_subset.collect()

print("\nFirst 5 rows:")
print(result.to_pandas())

view raw JSON →