{"id":9307,"library":"skrub","title":"Skrub","description":"Skrub is a Python library for machine learning with dataframes, offering robust tools for cleaning, preprocessing, and encoding tabular data, particularly for heterogeneous or messy datasets. It provides scikit-learn compatible transformers and a powerful DataOps API for complex data pipelines. The current version is 0.8.0, with regular minor and patch releases.","status":"active","version":"0.8.0","language":"en","source_language":"en","source_url":"https://github.com/skrub-data/skrub","tags":["machine-learning","data-preprocessing","tabular-data","scikit-learn-compatible","dataframe","categorical-encoding"],"install":[{"cmd":"pip install skrub","lang":"bash","label":"Install base library"},{"cmd":"pip install 'skrub[polars,all]'","lang":"bash","label":"Install with optional dependencies for Polars and all features"}],"dependencies":[{"reason":"Core dependency for transformers and pipelines, requires >=1.4.2 since 0.7.0.","package":"scikit-learn"},{"reason":"Used for some data fetching utilities, requires >=2.27.1 since 0.7.0.","package":"requests"},{"reason":"Optional dependency for enhanced performance/features in DataOps, requires >=1.5 since 0.8.0.","package":"polars","optional":true},{"reason":"Optional dependency for tuning DataOps pipelines.","package":"optuna","optional":true}],"imports":[{"symbol":"GapEncoder","correct":"from skrub import GapEncoder"},{"symbol":"TableVectorizer","correct":"from skrub import TableVectorizer"},{"symbol":"StringEncoder","correct":"from skrub import StringEncoder"},{"note":"DataOps was promoted to top-level import in 0.6.0. The old path still works but top-level is preferred.","wrong":"from skrub.dataops import DataOps","symbol":"DataOps","correct":"from skrub import DataOps"},{"symbol":"MinHashEncoder","correct":"from skrub import MinHashEncoder"}],"quickstart":{"code":"import pandas as pd\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import Pipeline\nfrom skrub import TableVectorizer\n\n# Sample DataFrame with messy categorical data\ndf = pd.DataFrame({\n    'city': ['Paris', 'london', 'New-York', 'paris', 'tokyo', 'new york'],\n    'country': ['France', 'United Kingdom', 'USA', 'France', 'Japan', 'United States'],\n    'price': [100, 150, 200, 110, 180, 210]\n})\n\nX = df[['city', 'country']]\ny = (df['price'] > 150).astype(int) # Binary target\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)\n\n# Create a scikit-learn pipeline with TableVectorizer\npipeline = Pipeline([\n    ('table_vectorizer', TableVectorizer(low_memory=True)), # Automatically handles different column types\n    ('classifier', LogisticRegression(random_state=42))\n])\n\n# Fit and evaluate the pipeline\npipeline.fit(X_train, y_train)\nscore = pipeline.score(X_test, y_test)\nprint(f\"Pipeline score: {score:.2f}\")","lang":"python","description":"This quickstart demonstrates how to use `skrub.TableVectorizer` within a scikit-learn pipeline to automatically preprocess a DataFrame containing mixed data types (here, messy categorical text) and then train a logistic regression model. `TableVectorizer` intelligently applies appropriate encoders to different column types."},"warnings":[{"fix":"Remove imports and usages of `ApplyToCols` and `ApplyToFrame`. Consult the skrub 0.8.0 documentation for alternative strategies for column-wise transformations.","message":"The `ApplyToCols` and `ApplyToFrame` transformers have been removed in version 0.8.0. Their functionality is now intended to be covered by other methods or a simplified `ApplyToCols` (if a new one was introduced under a different path).","severity":"breaking","affected_versions":">=0.8.0"},{"fix":"Ensure your Python environment is 3.10 or newer. Upgrade Python using your preferred package manager (e.g., `conda install python=3.10` or `pyenv install 3.10.12`).","message":"Minimum Python version increased to 3.10 in skrub 0.7.0. Installing or running on older Python environments will fail.","severity":"breaking","affected_versions":">=0.7.0"},{"fix":"Upgrade your `scikit-learn`, `requests`, and `polars` (if used) installations to meet the new minimum requirements: `pip install -U scikit-learn requests 'polars>=1.5'`.","message":"Minimum versions for key dependencies `scikit-learn` (>=1.4.2), `requests` (>=2.27.1) were increased in 0.7.0. Additionally, the minimum `polars` version (optional dependency) increased to >=1.5 in 0.8.0.","severity":"breaking","affected_versions":">=0.7.0, >=0.8.0 (for polars)"},{"fix":"Migrate away from `KenEmbeddings` to other available encoders provided by skrub, such as `GapEncoder` or `TableVectorizer`, which offer similar or enhanced functionality.","message":"Ken embeddings (`skrub.KenEmbeddings`) were deprecated in skrub 0.6.2 and will be removed in a future version. Usage will emit a `DeprecationWarning`.","severity":"deprecated","affected_versions":">=0.6.2"},{"fix":"If you were directly using `compute_ngram_distance`, switch to its private counterpart `_compute_ngram_distance`. Be aware that private functions may change without notice.","message":"The `compute_ngram_distance` utility function was made private (`_compute_ngram_distance`) in 0.7.2 to reduce API clutter and indicate it's not part of the public API.","severity":"gotcha","affected_versions":">=0.7.2"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Update your code to remove imports of `ApplyToCols` and refactor the transformation logic using other `skrub` transformers or direct pandas operations, as the old class no longer exists.","cause":"The `ApplyToCols` transformer class was removed in skrub version 0.8.0.","error":"ImportError: cannot import name 'ApplyToCols' from 'skrub'"},{"fix":"Replace `KenEmbeddings` with alternative encoders like `GapEncoder` or `TableVectorizer`. For example, `GapEncoder` often provides similar or better performance.","cause":"Your code is using `skrub.KenEmbeddings`, which has been deprecated since skrub 0.6.2.","error":"DeprecationWarning: Ken embeddings are deprecated and will be removed in a future version."},{"fix":"Upgrade your Python environment to version 3.10 or newer. Check `python --version` and install a newer version if needed.","cause":"Skrub versions 0.7.0 and later enforce a minimum Python version of 3.10.","error":"Your Python version is 3.9.x, but skrub >=0.7.0 requires Python >= 3.10. Please upgrade your Python version."},{"fix":"Review the skrub 0.7.1 release notes or current documentation for `DropCols` and `SelectCols` to identify the correct attribute names. For example, `drop_cols_` and `select_cols_` are common patterns.","cause":"Attribute names for `DropCols` and `SelectCols` instances were renamed in 0.7.1 for consistency (e.g., `columns_` might have been renamed to `cols_to_drop_` or `cols_to_select_`).","error":"AttributeError: 'DropCols' object has no attribute 'columns_'"}]}