dask-ml

2025.1.0 verified Mon Apr 27 auth: no python

A library for distributed and parallel machine learning built on top of Dask and scikit-learn. Current version is 2025.1.0, with releases roughly a few times a year.

pip install dask-ml

Common errors

error ValueError: could not broadcast input array from shape (X,) into shape (Y,) ↓

cause Mismatched chunk sizes in Dask array during fit.

fix

Ensure all input Dask arrays have known chunk sizes: X = X.compute_chunk_sizes()

error TypeError: Only Dask DataFrames are supported; got pandas DataFrame ↓

cause Passed a pandas DataFrame to dask_ml.model_selection functions.

fix

Convert using ddf = dask.dataframe.from_pandas(df, npartitions=...)

error ModuleNotFoundError: No module named 'dask_ml.models' ↓

cause Wrong import path for estimators.

fix

Use correct submodule, e.g., from dask_ml.linear_model import LogisticRegression

Warnings

breaking dask-ml 2024.3.20+ requires dask-expr; old dataframes from dask<2024.3 may break. ↓

fix Update dask to latest or pin dask-ml<2024.3.20.

deprecated dask_ml.cluster.KMeans is deprecated; use dask_ml.cluster.KMeans from dask-ml but it's being replaced by dask array's native k-means. ↓

fix Consider using sklearn's KMeans on dask arrays via map_blocks.

gotcha train_test_split from dask_ml.model_selection requires Dask DataFrames; passing pandas DataFrame gives unexpected results. ↓

fix Convert pandas DataFrame to dask DataFrame using dask.dataframe.from_pandas() before splitting.

gotcha Many estimators do not support Dask arrays with unknown chunk sizes; call .compute_chunk_sizes() first. ↓

fix Call .compute_chunk_sizes() on the Dask array before fitting.

Install

conda install -c conda-forge dask-ml

Imports

LogisticRegression

wrong

from dask_ml.models import LogisticRegression

correct

from dask_ml.linear_model import LogisticRegression

wrong module path

train_test_split

wrong

from sklearn.model_selection import train_test_split

correct

from dask_ml.model_selection import train_test_split

sklearn's API returns numpy arrays

preprocessing

wrong

from sklearn.preprocessing import StandardScaler

correct

from dask_ml.preprocessing import StandardScaler

sklearn's StandardScaler does not work on Dask arrays

Quickstart

Basic usage: load data with Dask, split, train a logistic regression model, and compute accuracy.

import dask.dataframe as dd
from dask_ml.linear_model import LogisticRegression
from dask_ml.model_selection import train_test_split

# Create a Dask DataFrame from a CSV
df = dd.read_csv('data.csv')
X = df[['feature1', 'feature2']].to_dask_array(lengths=True)
y = df['label'].to_dask_array(lengths=True)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Score
accuracy = model.score(X_test, y_test)
print(accuracy.compute())