Skrub

0.8.0 · active · verified Thu Apr 16

Skrub is a Python library for machine learning with dataframes, offering robust tools for cleaning, preprocessing, and encoding tabular data, particularly for heterogeneous or messy datasets. It provides scikit-learn compatible transformers and a powerful DataOps API for complex data pipelines. The current version is 0.8.0, with regular minor and patch releases.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `skrub.TableVectorizer` within a scikit-learn pipeline to automatically preprocess a DataFrame containing mixed data types (here, messy categorical text) and then train a logistic regression model. `TableVectorizer` intelligently applies appropriate encoders to different column types.

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from skrub import TableVectorizer

# Sample DataFrame with messy categorical data
df = pd.DataFrame({
    'city': ['Paris', 'london', 'New-York', 'paris', 'tokyo', 'new york'],
    'country': ['France', 'United Kingdom', 'USA', 'France', 'Japan', 'United States'],
    'price': [100, 150, 200, 110, 180, 210]
})

X = df[['city', 'country']]
y = (df['price'] > 150).astype(int) # Binary target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Create a scikit-learn pipeline with TableVectorizer
pipeline = Pipeline([
    ('table_vectorizer', TableVectorizer(low_memory=True)), # Automatically handles different column types
    ('classifier', LogisticRegression(random_state=42))
])

# Fit and evaluate the pipeline
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"Pipeline score: {score:.2f}")

view raw JSON →