{"id":5761,"library":"daft","title":"Daft Distributed Dataframes","description":"Daft is a high-performance data engine for AI and multimodal workloads. It provides a Python DataFrame API for processing images, audio, video, and structured data at any scale, built with Rust under the hood for performance and seamless scaling from local to distributed clusters. It is actively developed by Eventual Inc., with frequent releases.","status":"active","version":"0.7.9","language":"en","source_language":"en","source_url":"https://github.com/Eventual-Inc/Daft","tags":["dataframe","distributed","multimodal","AI","ML","data processing","Rust","lazy evaluation"],"install":[{"cmd":"pip install daft","lang":"bash","label":"Basic Installation"},{"cmd":"pip install \"daft[openai]\"","lang":"bash","label":"With OpenAI integration"},{"cmd":"pip install \"daft[ray,aws]\"","lang":"bash","label":"With Ray and AWS integration"}],"dependencies":[{"reason":"Requires Python 3.10 or higher.","package":"python","optional":false},{"reason":"Often used for numerical data in examples and UDFs.","package":"numpy","optional":true},{"reason":"Required for image processing capabilities.","package":"pillow","optional":true},{"reason":"Needed for built-in AI operations using OpenAI models (e.g., embeddings, LLM prompts).","package":"openai","optional":true}],"imports":[{"note":"The package was renamed from `getdaft` to `daft`.","wrong":"import getdaft","symbol":"daft","correct":"import daft"}],"quickstart":{"code":"import daft\nimport os\n\n# Load an e-commerce dataset from Hugging Face\n# Requires 'daft[huggingface]' to be installed if not already part of your setup\ndf = daft.read_huggingface(\"calmgoose/amazon-product-data-2020\")\n\n# Inspect the schema (Daft is lazy, so no data is fetched yet)\nprint(\"Schema:\")\nprint(df)\n\n# Select a few columns and materialize the first 5 rows to see data\ndf_subset = df.select(df[\"Product Name\"], df[\"Category\"]).limit(5)\nresult = df_subset.collect()\n\nprint(\"\\nFirst 5 rows:\")\nprint(result.to_pandas())","lang":"python","description":"This quickstart demonstrates loading a dataset from Hugging Face, inspecting the lazy DataFrame schema, selecting columns, and materializing a small subset into a Pandas DataFrame to view the data."},"warnings":[{"fix":"Use `pip install daft` and update imports from `import getdaft` to `import daft`.","message":"The original package `getdaft` was renamed to `daft`. Older installations or scripts referencing `getdaft` will fail.","severity":"breaking","affected_versions":"<=0.5.0 (for `getdaft`)"},{"fix":"Remember to call an action method (e.g., `.collect().to_pandas()`) to materialize and retrieve data from a DataFrame.","message":"Daft DataFrames are lazy by default. Operations like `select`, `where`, `with_column` build a query plan but do not execute or fetch data until an action such as `collect()`, `show()`, `to_pandas()`, or `write_*()` is explicitly called.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify you are installing `daft` (not `daft-pgm`) and importing `daft`.","message":"There is another distinct Python package named `daft-pgm` (for Probabilistic Graphical Models). Ensure you install and import the correct `daft` library for distributed dataframes, which is `daft` from the `Eventual-Inc/Daft` project.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Explicitly set `ddof` when calling `stddev()` if you rely on a specific degree of freedom correction (e.g., `df.agg.stddev(ddof=0)`).","message":"The default `ddof` parameter for the `stddev` aggregation function changed to `1`. Previously, it might have defaulted to `0` or behaved inconsistently depending on the underlying implementation.","severity":"breaking","affected_versions":">=0.7.5"},{"fix":"Review existing code for direct interaction with `arrow2` or specific Arrow table/array implementations. Test thoroughly after upgrading, especially for data loading and serialization/deserialization workflows.","message":"Significant internal refactors have occurred, especially concerning interactions with Apache Arrow libraries (migrating from `arrow2` to `arrow-rs`). While primarily internal, users with highly customized integrations or those relying on specific Arrow-related behaviors might experience subtle changes or compatibility issues.","severity":"breaking","affected_versions":">=0.7.5"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}