Daft Distributed Dataframes
Daft is a high-performance data engine for AI and multimodal workloads. It provides a Python DataFrame API for processing images, audio, video, and structured data at any scale, built with Rust under the hood for performance and seamless scaling from local to distributed clusters. It is actively developed by Eventual Inc., with frequent releases.
Warnings
- breaking The original package `getdaft` was renamed to `daft`. Older installations or scripts referencing `getdaft` will fail.
- gotcha Daft DataFrames are lazy by default. Operations like `select`, `where`, `with_column` build a query plan but do not execute or fetch data until an action such as `collect()`, `show()`, `to_pandas()`, or `write_*()` is explicitly called.
- gotcha There is another distinct Python package named `daft-pgm` (for Probabilistic Graphical Models). Ensure you install and import the correct `daft` library for distributed dataframes, which is `daft` from the `Eventual-Inc/Daft` project.
- breaking The default `ddof` parameter for the `stddev` aggregation function changed to `1`. Previously, it might have defaulted to `0` or behaved inconsistently depending on the underlying implementation.
- breaking Significant internal refactors have occurred, especially concerning interactions with Apache Arrow libraries (migrating from `arrow2` to `arrow-rs`). While primarily internal, users with highly customized integrations or those relying on specific Arrow-related behaviors might experience subtle changes or compatibility issues.
Install
-
pip install daft -
pip install "daft[openai]" -
pip install "daft[ray,aws]"
Imports
- daft
import daft
Quickstart
import daft
import os
# Load an e-commerce dataset from Hugging Face
# Requires 'daft[huggingface]' to be installed if not already part of your setup
df = daft.read_huggingface("calmgoose/amazon-product-data-2020")
# Inspect the schema (Daft is lazy, so no data is fetched yet)
print("Schema:")
print(df)
# Select a few columns and materialize the first 5 rows to see data
df_subset = df.select(df["Product Name"], df["Category"]).limit(5)
result = df_subset.collect()
print("\nFirst 5 rows:")
print(result.to_pandas())