Dask Expressions
Dask-expr provides a high-level expression system for Dask DataFrames, focusing on query optimization and improved organization. It became the default backend for `dask.dataframe` since Dask version 2024.3.0. The library, currently at version 2.0.0 (released January 21, 2025), is primarily maintained as part of the main Dask project, with its separate GitHub repository no longer actively maintained.
Warnings
- breaking The `dask-expr` library, particularly versions 0.5.1 and above, requires `pandas>=2`. This can lead to dependency conflicts if other libraries in your environment pin `pandas` to an older version (e.g., `<2`).
- gotcha The `dask-expr` GitHub repository is no longer actively maintained, as its implementation has been moved into the main `dask/dask` repository and it's now the default backend for `dask.dataframe`. While the PyPI package still exists, new development and core maintenance happen within the Dask project itself.
- gotcha Using `df.persist()` with the dask-expr query optimizer can sometimes block optimizations like projection or filter pushdown into the I/O layer.
- gotcha Dask-expr does not currently support 'named GroupBy Aggregations', which is a feature available in the legacy Dask DataFrame API.
Install
-
pip install dask-expr -
conda install conda-forge::dask-expr
Imports
- dask_expr
import dask_expr as dx
- dask.dataframe
import dask.dataframe as dd
Quickstart
import dask.dataframe as dd
# Create a Dask DataFrame (internally uses the dask-expr system)
df = dd.from_dict({'a': range(1000), 'b': [f'cat_{i%5}' for i in range(1000)]}, npartitions=4)
# Perform some operations
result = df.groupby('b')['a'].mean()
# Compute the result
print(result.compute())
# To see the optimized query plan (requires graphviz to be installed)
# try:
# df.optimize().explain()
# except ImportError:
# print("Install graphviz to visualize the query plan: pip install 'graphviz'")