Feature-engine

1.9.4 · active · verified Thu Apr 16

Feature-engine is an open-source Python library offering a comprehensive suite of transformers for feature engineering and selection in machine learning. It provides functionality for missing data imputation, categorical encoding, discretisation, outlier handling, feature transformation, creation, and selection. Compatible with Scikit-learn's `fit()` and `transform()` API, Feature-engine currently stands at version 1.9.4 and undergoes routine development with new releases.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `feature-engine`'s `MeanMedianImputer` to handle missing data. It loads a dataset, splits it into training and testing sets, fits the imputer on the training data, and then transforms both sets. This follows the standard Scikit-learn `fit()` and `transform()` pattern, ensuring proper parameter learning from training data.

import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer

# Load dataset
X, y = fetch_openml(name="house_prices", version=1, as_frame=True, return_X_y=True)

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0
)

# Initialize a MeanMedianImputer for specified numerical variables
# It will impute 'median' for LotFrontage and MasVnrArea
median_imputer = MeanMedianImputer(
    imputation_method='median',
    variables=['LotFrontage', 'MasVnrArea']
)

# Fit the imputer on the training data
median_imputer.fit(X_train)

# Transform both training and test data
X_train_imputed = median_imputer.transform(X_train)
X_test_imputed = median_imputer.transform(X_test)

print("Missing values in 'LotFrontage' before imputation (train):", X_train['LotFrontage'].isnull().sum())
print("Missing values in 'LotFrontage' after imputation (train):", X_train_imputed['LotFrontage'].isnull().sum())
print("Missing values in 'MasVnrArea' before imputation (test):", X_test['MasVnrArea'].isnull().sum())
print("Missing values in 'MasVnrArea' after imputation (test):", X_test_imputed['MasVnrArea'].isnull().sum())

view raw JSON →