dask-ml

raw JSON →
2025.1.0 verified Mon Apr 27 auth: no python

A library for distributed and parallel machine learning built on top of Dask and scikit-learn. Current version is 2025.1.0, with releases roughly a few times a year.

pip install dask-ml
error ValueError: could not broadcast input array from shape (X,) into shape (Y,)
cause Mismatched chunk sizes in Dask array during fit.
fix
Ensure all input Dask arrays have known chunk sizes: X = X.compute_chunk_sizes()
error TypeError: Only Dask DataFrames are supported; got pandas DataFrame
cause Passed a pandas DataFrame to dask_ml.model_selection functions.
fix
Convert using ddf = dask.dataframe.from_pandas(df, npartitions=...)
error ModuleNotFoundError: No module named 'dask_ml.models'
cause Wrong import path for estimators.
fix
Use correct submodule, e.g., from dask_ml.linear_model import LogisticRegression
breaking dask-ml 2024.3.20+ requires dask-expr; old dataframes from dask<2024.3 may break.
fix Update dask to latest or pin dask-ml<2024.3.20.
deprecated dask_ml.cluster.KMeans is deprecated; use dask_ml.cluster.KMeans from dask-ml but it's being replaced by dask array's native k-means.
fix Consider using sklearn's KMeans on dask arrays via map_blocks.
gotcha train_test_split from dask_ml.model_selection requires Dask DataFrames; passing pandas DataFrame gives unexpected results.
fix Convert pandas DataFrame to dask DataFrame using dask.dataframe.from_pandas() before splitting.
gotcha Many estimators do not support Dask arrays with unknown chunk sizes; call .compute_chunk_sizes() first.
fix Call .compute_chunk_sizes() on the Dask array before fitting.
conda install -c conda-forge dask-ml

Basic usage: load data with Dask, split, train a logistic regression model, and compute accuracy.

import dask.dataframe as dd
from dask_ml.linear_model import LogisticRegression
from dask_ml.model_selection import train_test_split

# Create a Dask DataFrame from a CSV
df = dd.read_csv('data.csv')
X = df[['feature1', 'feature2']].to_dask_array(lengths=True)
y = df['label'].to_dask_array(lengths=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Score
accuracy = model.score(X_test, y_test)
print(accuracy.compute())