sklearn-pandas
sklearn-pandas (current version 2.2.0) provides a bridge between Scikit-Learn's machine learning methods and pandas DataFrames. It allows users to map DataFrame columns to different scikit-learn transformations, which are then recombined into features for model training. The library aims to streamline data preprocessing workflows involving both pandas and scikit-learn.
Common errors
-
KeyError: '[column_name] not in index'
cause DataFrameMapper was initialized with a column name that does not exist in the DataFrame being passed to `fit_transform` or `transform`.fixVerify that all column names specified in your `DataFrameMapper` configuration exactly match the column names in your input pandas DataFrame. Use `df.columns` to inspect available columns. -
ValueError: Expected 2D array, got 1D array instead:
cause A scikit-learn transformer (e.g., `StandardScaler`, `MinMaxScaler`) that expects a 2D array as input received a 1D array, often because a single column was passed as a string instead of a list.fixWhen specifying columns for transformers that require 2D input, enclose the column name in a list: `(['my_column'], StandardScaler())` instead of `('my_column', StandardScaler())`. -
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float)
cause This often indicates inconsistent data types within a column or unexpected non-numeric values (e.g., empty strings in a numerical column) that a downstream estimator cannot process, especially after `DataFrameMapper` converts to a NumPy array.fixEnsure all columns intended for numerical processing contain only numerical data and handle missing values appropriately (e.g., `fillna` on the DataFrame before passing to `DataFrameMapper` or use an imputer within the mapper for numerical columns).
Warnings
- deprecated `NumericalTransformer` was deprecated in `v2.1.0` and is slated for removal in a future release. Users should migrate to native scikit-learn transformers or implement custom transformers.
- breaking Functionalities like `CategoricalImputer`, `cross_val_score`, and `GridSearchCV` were removed in `sklearn-pandas v2.0.0`. Their equivalent features are now available directly within `scikit-learn`.
- gotcha By default, `DataFrameMapper.transform()` outputs a NumPy array, not a pandas DataFrame. This can lead to loss of column names and type information.
- gotcha The way a column is specified in `DataFrameMapper` (e.g., `'column_name'` vs. `['column_name']`) affects the shape of the array passed to the transformer (1D array vs. 2D array/column vector). Some scikit-learn transformers expect a 2D input.
- gotcha While `sklearn-pandas` bridges DataFrame functionality, recent versions of `scikit-learn` (v1.2+) introduced native `set_config(transform_output="pandas")` for transformers. This may reduce the need for `sklearn-pandas` in certain `sklearn.pipeline` contexts, but `DataFrameMapper` still offers granular column-wise transformation definition.
Install
-
pip install sklearn-pandas
Imports
- DataFrameMapper
from sklearn_pandas import DataFrameMapper
- CategoricalImputer
from sklearn_pandas import CategoricalImputer
from sklearn.impute import SimpleImputer # Replaced in sklearn-pandas >= 2.0.0
Quickstart
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer, StandardScaler
data = pd.DataFrame({
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90., 24, 44, 27, 32, 59, 36, 27]
})
# Map DataFrame columns to Scikit-learn transformations
mapper = DataFrameMapper([
('pet', LabelBinarizer()),
(['children'], StandardScaler()),
('salary', None) # 'None' keeps the column without transformation
], df_out=True) # Set df_out=True to get a DataFrame output (requires pandas >= 1.0)
transformed_data = mapper.fit_transform(data.copy())
print(transformed_data.head())
print(transformed_data.columns)