{"id":8646,"library":"sklearn-pandas","title":"sklearn-pandas","description":"sklearn-pandas (current version 2.2.0) provides a bridge between Scikit-Learn's machine learning methods and pandas DataFrames. It allows users to map DataFrame columns to different scikit-learn transformations, which are then recombined into features for model training. The library aims to streamline data preprocessing workflows involving both pandas and scikit-learn.","status":"active","version":"2.2.0","language":"en","source_language":"en","source_url":"https://github.com/scikit-learn-contrib/sklearn-pandas","tags":["data transformation","scikit-learn","pandas","machine learning","feature engineering","preprocessing"],"install":[{"cmd":"pip install sklearn-pandas","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Fundamental numerical computing library for array operations.","package":"numpy","optional":false},{"reason":"Core data structure for DataFrames used in mapping transformations.","package":"pandas","optional":false},{"reason":"Scientific computing library, often a dependency of scikit-learn.","package":"scipy","optional":false},{"reason":"Machine learning library providing the transformers and estimators.","package":"scikit-learn","optional":false}],"imports":[{"symbol":"DataFrameMapper","correct":"from sklearn_pandas import DataFrameMapper"},{"note":"CategoricalImputer was removed in sklearn-pandas 2.0.0 as similar functionality is now available directly in scikit-learn (e.g., SimpleImputer with 'most_frequent' strategy).","wrong":"from sklearn_pandas import CategoricalImputer","symbol":"CategoricalImputer","correct":"from sklearn.impute import SimpleImputer # Replaced in sklearn-pandas >= 2.0.0"}],"quickstart":{"code":"import pandas as pd\nfrom sklearn_pandas import DataFrameMapper\nfrom sklearn.preprocessing import LabelBinarizer, StandardScaler\n\ndata = pd.DataFrame({\n    'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],\n    'children': [4., 6, 3, 3, 2, 3, 5, 4],\n    'salary': [90., 24, 44, 27, 32, 59, 36, 27]\n})\n\n# Map DataFrame columns to Scikit-learn transformations\nmapper = DataFrameMapper([\n    ('pet', LabelBinarizer()),\n    (['children'], StandardScaler()),\n    ('salary', None) # 'None' keeps the column without transformation\n], df_out=True) # Set df_out=True to get a DataFrame output (requires pandas >= 1.0)\n\ntransformed_data = mapper.fit_transform(data.copy())\nprint(transformed_data.head())\nprint(transformed_data.columns)","lang":"python","description":"This quickstart demonstrates how to use `DataFrameMapper` to apply different scikit-learn transformers to specific columns of a pandas DataFrame. Categorical 'pet' column is binarized, 'children' is standardized, and 'salary' is kept as is. Setting `df_out=True` (requires pandas >= 1.0) ensures the output is a DataFrame rather than a NumPy array."},"warnings":[{"fix":"Avoid using `NumericalTransformer`. For common numerical transformations, use `sklearn.preprocessing` modules or custom `FunctionTransformer` instances.","message":"`NumericalTransformer` was deprecated in `v2.1.0` and is slated for removal in a future release. Users should migrate to native scikit-learn transformers or implement custom transformers.","severity":"deprecated","affected_versions":">=2.1.0"},{"fix":"Replace `CategoricalImputer` with `sklearn.impute.SimpleImputer(strategy='most_frequent')`. For cross-validation and grid search, use `sklearn.model_selection.cross_val_score` and `GridSearchCV` directly, as they now support pandas DataFrames.","message":"Functionalities like `CategoricalImputer`, `cross_val_score`, and `GridSearchCV` were removed in `sklearn-pandas v2.0.0`. Their equivalent features are now available directly within `scikit-learn`.","severity":"breaking","affected_versions":">=2.0.0"},{"fix":"To receive a pandas DataFrame as output (if using pandas >= 1.0), initialize `DataFrameMapper` with `df_out=True`. Otherwise, manually convert the output NumPy array back to a DataFrame and re-add column names if needed.","message":"By default, `DataFrameMapper.transform()` outputs a NumPy array, not a pandas DataFrame. This can lead to loss of column names and type information.","severity":"gotcha","affected_versions":"<2.2.0 (and default in >=2.2.0)"},{"fix":"For transformers expecting 2D input (e.g., `StandardScaler`), always pass column names as a list: `(['column_name'], Transformer())`. For transformers that can handle 1D input (e.g., `LabelBinarizer`), a string is often sufficient, but using a list ensures 2D input.","message":"The way a column is specified in `DataFrameMapper` (e.g., `'column_name'` vs. `['column_name']`) affects the shape of the array passed to the transformer (1D array vs. 2D array/column vector). Some scikit-learn transformers expect a 2D input.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Consider `scikit-learn`'s native DataFrame output features and `ColumnTransformer` alongside `sklearn-pandas` for new projects or when refactoring, especially if a simpler pipeline without complex column selection logic is sufficient.","message":"While `sklearn-pandas` bridges DataFrame functionality, recent versions of `scikit-learn` (v1.2+) introduced native `set_config(transform_output=\"pandas\")` for transformers. This may reduce the need for `sklearn-pandas` in certain `sklearn.pipeline` contexts, but `DataFrameMapper` still offers granular column-wise transformation definition.","severity":"gotcha","affected_versions":"scikit-learn >= 1.2"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Verify that all column names specified in your `DataFrameMapper` configuration exactly match the column names in your input pandas DataFrame. Use `df.columns` to inspect available columns.","cause":"DataFrameMapper was initialized with a column name that does not exist in the DataFrame being passed to `fit_transform` or `transform`.","error":"KeyError: '[column_name] not in index'"},{"fix":"When specifying columns for transformers that require 2D input, enclose the column name in a list: `(['my_column'], StandardScaler())` instead of `('my_column', StandardScaler())`.","cause":"A scikit-learn transformer (e.g., `StandardScaler`, `MinMaxScaler`) that expects a 2D array as input received a 1D array, often because a single column was passed as a string instead of a list.","error":"ValueError: Expected 2D array, got 1D array instead:"},{"fix":"Ensure all columns intended for numerical processing contain only numerical data and handle missing values appropriately (e.g., `fillna` on the DataFrame before passing to `DataFrameMapper` or use an imputer within the mapper for numerical columns).","cause":"This often indicates inconsistent data types within a column or unexpected non-numeric values (e.g., empty strings in a numerical column) that a downstream estimator cannot process, especially after `DataFrameMapper` converts to a NumPy array.","error":"ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float)"}]}