{"library":"skrub","title":"Skrub","description":"Skrub is a Python library for machine learning with dataframes, offering robust tools for cleaning, preprocessing, and encoding tabular data, particularly for heterogeneous or messy datasets. It provides scikit-learn compatible transformers and a powerful DataOps API for complex data pipelines. The current version is 0.8.0, with regular minor and patch releases.","language":"python","status":"active","last_verified":"Thu Apr 16","install":{"commands":["pip install skrub","pip install 'skrub[polars,all]'"],"cli":null},"imports":["from skrub import GapEncoder","from skrub import TableVectorizer","from skrub import StringEncoder","from skrub import DataOps","from skrub import MinHashEncoder"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import Pipeline\nfrom skrub import TableVectorizer\n\n# Sample DataFrame with messy categorical data\ndf = pd.DataFrame({\n    'city': ['Paris', 'london', 'New-York', 'paris', 'tokyo', 'new york'],\n    'country': ['France', 'United Kingdom', 'USA', 'France', 'Japan', 'United States'],\n    'price': [100, 150, 200, 110, 180, 210]\n})\n\nX = df[['city', 'country']]\ny = (df['price'] > 150).astype(int) # Binary target\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\n# Create a scikit-learn pipeline with TableVectorizer\npipeline = Pipeline([\n    ('table_vectorizer', TableVectorizer(low_memory=True)), # Automatically handles different column types\n    ('classifier', LogisticRegression(random_state=42))\n])\n\n# Fit and evaluate the pipeline\npipeline.fit(X_train, y_train)\nscore = pipeline.score(X_test, y_test)\nprint(f\"Pipeline score: {score:.2f}\")","lang":"python","description":"This quickstart demonstrates how to use `skrub.TableVectorizer` within a scikit-learn pipeline to automatically preprocess a DataFrame containing mixed data types (here, messy categorical text) and then train a logistic regression model. `TableVectorizer` intelligently applies appropriate encoders to different column types.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}