Skrub
Skrub is a Python library for machine learning with dataframes, offering robust tools for cleaning, preprocessing, and encoding tabular data, particularly for heterogeneous or messy datasets. It provides scikit-learn compatible transformers and a powerful DataOps API for complex data pipelines. The current version is 0.8.0, with regular minor and patch releases.
Common errors
-
ImportError: cannot import name 'ApplyToCols' from 'skrub'
cause The `ApplyToCols` transformer class was removed in skrub version 0.8.0.fixUpdate your code to remove imports of `ApplyToCols` and refactor the transformation logic using other `skrub` transformers or direct pandas operations, as the old class no longer exists. -
DeprecationWarning: Ken embeddings are deprecated and will be removed in a future version.
cause Your code is using `skrub.KenEmbeddings`, which has been deprecated since skrub 0.6.2.fixReplace `KenEmbeddings` with alternative encoders like `GapEncoder` or `TableVectorizer`. For example, `GapEncoder` often provides similar or better performance. -
Your Python version is 3.9.x, but skrub >=0.7.0 requires Python >= 3.10. Please upgrade your Python version.
cause Skrub versions 0.7.0 and later enforce a minimum Python version of 3.10.fixUpgrade your Python environment to version 3.10 or newer. Check `python --version` and install a newer version if needed. -
AttributeError: 'DropCols' object has no attribute 'columns_'
cause Attribute names for `DropCols` and `SelectCols` instances were renamed in 0.7.1 for consistency (e.g., `columns_` might have been renamed to `cols_to_drop_` or `cols_to_select_`).fixReview the skrub 0.7.1 release notes or current documentation for `DropCols` and `SelectCols` to identify the correct attribute names. For example, `drop_cols_` and `select_cols_` are common patterns.
Warnings
- breaking The `ApplyToCols` and `ApplyToFrame` transformers have been removed in version 0.8.0. Their functionality is now intended to be covered by other methods or a simplified `ApplyToCols` (if a new one was introduced under a different path).
- breaking Minimum Python version increased to 3.10 in skrub 0.7.0. Installing or running on older Python environments will fail.
- breaking Minimum versions for key dependencies `scikit-learn` (>=1.4.2), `requests` (>=2.27.1) were increased in 0.7.0. Additionally, the minimum `polars` version (optional dependency) increased to >=1.5 in 0.8.0.
- deprecated Ken embeddings (`skrub.KenEmbeddings`) were deprecated in skrub 0.6.2 and will be removed in a future version. Usage will emit a `DeprecationWarning`.
- gotcha The `compute_ngram_distance` utility function was made private (`_compute_ngram_distance`) in 0.7.2 to reduce API clutter and indicate it's not part of the public API.
Install
-
pip install skrub -
pip install 'skrub[polars,all]'
Imports
- GapEncoder
from skrub import GapEncoder
- TableVectorizer
from skrub import TableVectorizer
- StringEncoder
from skrub import StringEncoder
- DataOps
from skrub.dataops import DataOps
from skrub import DataOps
- MinHashEncoder
from skrub import MinHashEncoder
Quickstart
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from skrub import TableVectorizer
# Sample DataFrame with messy categorical data
df = pd.DataFrame({
'city': ['Paris', 'london', 'New-York', 'paris', 'tokyo', 'new york'],
'country': ['France', 'United Kingdom', 'USA', 'France', 'Japan', 'United States'],
'price': [100, 150, 200, 110, 180, 210]
})
X = df[['city', 'country']]
y = (df['price'] > 150).astype(int) # Binary target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Create a scikit-learn pipeline with TableVectorizer
pipeline = Pipeline([
('table_vectorizer', TableVectorizer(low_memory=True)), # Automatically handles different column types
('classifier', LogisticRegression(random_state=42))
])
# Fit and evaluate the pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline score: {score:.2f}")