sklearn-pandas

2.2.0 · active · verified Thu Apr 16

sklearn-pandas (current version 2.2.0) provides a bridge between Scikit-Learn's machine learning methods and pandas DataFrames. It allows users to map DataFrame columns to different scikit-learn transformations, which are then recombined into features for model training. The library aims to streamline data preprocessing workflows involving both pandas and scikit-learn.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `DataFrameMapper` to apply different scikit-learn transformers to specific columns of a pandas DataFrame. Categorical 'pet' column is binarized, 'children' is standardized, and 'salary' is kept as is. Setting `df_out=True` (requires pandas >= 1.0) ensures the output is a DataFrame rather than a NumPy array.

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer, StandardScaler

data = pd.DataFrame({
    'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary': [90., 24, 44, 27, 32, 59, 36, 27]
})

# Map DataFrame columns to Scikit-learn transformations
mapper = DataFrameMapper([
    ('pet', LabelBinarizer()),
    (['children'], StandardScaler()),
    ('salary', None) # 'None' keeps the column without transformation
], df_out=True) # Set df_out=True to get a DataFrame output (requires pandas >= 1.0)

transformed_data = mapper.fit_transform(data.copy())
print(transformed_data.head())
print(transformed_data.columns)

view raw JSON →