Feature-engine
Feature-engine is an open-source Python library offering a comprehensive suite of transformers for feature engineering and selection in machine learning. It provides functionality for missing data imputation, categorical encoding, discretisation, outlier handling, feature transformation, creation, and selection. Compatible with Scikit-learn's `fit()` and `transform()` API, Feature-engine currently stands at version 1.9.4 and undergoes routine development with new releases.
Common errors
-
TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer
cause A categorical encoding transformer was applied to variables that are not of pandas 'object' or 'category' dtype, but rather numerical.fixConvert the target numerical column(s) to 'object' or 'category' dtype before fitting the encoder, or set `ignore_format=True` in the encoder's constructor if you deliberately want to encode numerical columns (e.g., `df['col'] = df['col'].astype('object')`). -
KeyError: "['some_category'] not in index"
cause This often occurs when trying to directly access or manipulate categories that are present in the test set but were not seen in the training set by a fitted encoder or discretizer.fixUse `feature_engine.preprocessing.MatchCategories()` at the preprocessing stage to ensure consistent categories across train and test sets. Alternatively, review your chosen encoder's parameters for handling unknown categories (e.g., `RareLabelEncoder(tol=...)`, `OneHotEncoder(handle_unknown='ignore')`).
Warnings
- breaking Module paths and some class names were renamed in v1.0.0 to better reflect their functionality and align with Scikit-learn conventions. For example, `feature_engine.imputers` became `feature_engine.imputation`.
- gotcha Feature-engine's categorical encoders (e.g., `MeanEncoder`, `OneHotEncoder`) by default expect variables to be of pandas `object` or `category` dtype. Providing numerical variables without explicit handling will raise a `TypeError`.
- gotcha When dealing with categorical features, it's common for the test set to contain categories not present in the training set ('unseen categories'). This can lead to errors during transformation, especially with certain encoding schemes.
- breaking In v1.1.0, most transformers gained a new attribute `variables_` which contains the names of the variables that were actually modified by the transformer. While the old `variables` attribute is generally retained, `variables_` should be preferred for consistency and accuracy.
Install
-
pip install feature-engine -
conda install -c conda-forge feature_engine
Imports
- MeanMedianImputer
from feature_engine.imputers import MeanMedianImputer
from feature_engine.imputation import MeanMedianImputer
- OneHotEncoder
from feature_engine.categorical_encoders import OneHotEncoder
from feature_engine.encoding import OneHotEncoder
- DropCorrelatedFeatures
from feature_engine.variable_selection import DropCorrelatedFeatures
from feature_engine.selection import DropCorrelatedFeatures
Quickstart
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from feature_engine.imputation import MeanMedianImputer
# Load dataset
X, y = fetch_openml(name="house_prices", version=1, as_frame=True, return_X_y=True)
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0
)
# Initialize a MeanMedianImputer for specified numerical variables
# It will impute 'median' for LotFrontage and MasVnrArea
median_imputer = MeanMedianImputer(
imputation_method='median',
variables=['LotFrontage', 'MasVnrArea']
)
# Fit the imputer on the training data
median_imputer.fit(X_train)
# Transform both training and test data
X_train_imputed = median_imputer.transform(X_train)
X_test_imputed = median_imputer.transform(X_test)
print("Missing values in 'LotFrontage' before imputation (train):", X_train['LotFrontage'].isnull().sum())
print("Missing values in 'LotFrontage' after imputation (train):", X_train_imputed['LotFrontage'].isnull().sum())
print("Missing values in 'MasVnrArea' before imputation (test):", X_test['MasVnrArea'].isnull().sum())
print("Missing values in 'MasVnrArea' after imputation (test):", X_test_imputed['MasVnrArea'].isnull().sum())