dask-ml
raw JSON → 2025.1.0 verified Mon Apr 27 auth: no python
A library for distributed and parallel machine learning built on top of Dask and scikit-learn. Current version is 2025.1.0, with releases roughly a few times a year.
pip install dask-ml Common errors
error ValueError: could not broadcast input array from shape (X,) into shape (Y,) ↓
cause Mismatched chunk sizes in Dask array during fit.
fix
Ensure all input Dask arrays have known chunk sizes: X = X.compute_chunk_sizes()
error TypeError: Only Dask DataFrames are supported; got pandas DataFrame ↓
cause Passed a pandas DataFrame to dask_ml.model_selection functions.
fix
Convert using ddf = dask.dataframe.from_pandas(df, npartitions=...)
error ModuleNotFoundError: No module named 'dask_ml.models' ↓
cause Wrong import path for estimators.
fix
Use correct submodule, e.g., from dask_ml.linear_model import LogisticRegression
Warnings
breaking dask-ml 2024.3.20+ requires dask-expr; old dataframes from dask<2024.3 may break. ↓
fix Update dask to latest or pin dask-ml<2024.3.20.
deprecated dask_ml.cluster.KMeans is deprecated; use dask_ml.cluster.KMeans from dask-ml but it's being replaced by dask array's native k-means. ↓
fix Consider using sklearn's KMeans on dask arrays via map_blocks.
gotcha train_test_split from dask_ml.model_selection requires Dask DataFrames; passing pandas DataFrame gives unexpected results. ↓
fix Convert pandas DataFrame to dask DataFrame using dask.dataframe.from_pandas() before splitting.
gotcha Many estimators do not support Dask arrays with unknown chunk sizes; call .compute_chunk_sizes() first. ↓
fix Call .compute_chunk_sizes() on the Dask array before fitting.
Install
conda install -c conda-forge dask-ml Imports
- LogisticRegression wrong
from dask_ml.models import LogisticRegressioncorrectfrom dask_ml.linear_model import LogisticRegression - train_test_split wrong
from sklearn.model_selection import train_test_splitcorrectfrom dask_ml.model_selection import train_test_split - preprocessing wrong
from sklearn.preprocessing import StandardScalercorrectfrom dask_ml.preprocessing import StandardScaler
Quickstart
import dask.dataframe as dd
from dask_ml.linear_model import LogisticRegression
from dask_ml.model_selection import train_test_split
# Create a Dask DataFrame from a CSV
df = dd.read_csv('data.csv')
X = df[['feature1', 'feature2']].to_dask_array(lengths=True)
y = df['label'].to_dask_array(lengths=True)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)
# Score
accuracy = model.score(X_test, y_test)
print(accuracy.compute())