{"id":7220,"library":"feature-engine","title":"Feature-engine","description":"Feature-engine is an open-source Python library offering a comprehensive suite of transformers for feature engineering and selection in machine learning. It provides functionality for missing data imputation, categorical encoding, discretisation, outlier handling, feature transformation, creation, and selection. Compatible with Scikit-learn's `fit()` and `transform()` API, Feature-engine currently stands at version 1.9.4 and undergoes routine development with new releases.","status":"active","version":"1.9.4","language":"en","source_language":"en","source_url":"http://github.com/feature-engine/feature_engine","tags":["machine learning","feature engineering","feature selection","data science","preprocessing","scikit-learn","imputation","encoding"],"install":[{"cmd":"pip install feature-engine","lang":"bash","label":"PyPI"},{"cmd":"conda install -c conda-forge feature_engine","lang":"bash","label":"Anaconda/Conda-forge"}],"dependencies":[{"reason":"Core data structure for transformers (DataFrame in, DataFrame out).","package":"pandas","optional":false},{"reason":"Provides the API compatibility (fit/transform) and pipeline integration.","package":"scikit-learn","optional":false},{"reason":"Underlying numerical operations.","package":"numpy","optional":false},{"reason":"Statistical operations for some transformers.","package":"scipy","optional":false},{"reason":"Statistical models for certain transformations/selections.","package":"statsmodels","optional":true},{"reason":"Used in examples and for plotting distributions.","package":"matplotlib","optional":true},{"reason":"Used in examples and for plotting distributions.","package":"seaborn","optional":true}],"imports":[{"note":"Module paths were renamed in v1.0.0. The correct path is now `feature_engine.imputation`.","wrong":"from feature_engine.imputers import MeanMedianImputer","symbol":"MeanMedianImputer","correct":"from feature_engine.imputation import MeanMedianImputer"},{"note":"Module paths were renamed in v1.0.0. The correct path is now `feature_engine.encoding`.","wrong":"from feature_engine.categorical_encoders import OneHotEncoder","symbol":"OneHotEncoder","correct":"from feature_engine.encoding import OneHotEncoder"},{"note":"Module paths were renamed in v1.0.0. The correct path is now `feature_engine.selection`.","wrong":"from feature_engine.variable_selection import DropCorrelatedFeatures","symbol":"DropCorrelatedFeatures","correct":"from feature_engine.selection import DropCorrelatedFeatures"}],"quickstart":{"code":"import pandas as pd\nfrom sklearn.datasets import fetch_openml\nfrom sklearn.model_selection import train_test_split\nfrom feature_engine.imputation import MeanMedianImputer\n\n# Load dataset\nX, y = fetch_openml(name=\"house_prices\", version=1, as_frame=True, return_X_y=True)\n\n# Separate into train and test sets\nX_train, X_test, y_train, y_test = train_test_split(\n    X, y, test_size=0.3, random_state=0\n)\n\n# Initialize a MeanMedianImputer for specified numerical variables\n# It will impute 'median' for LotFrontage and MasVnrArea\nmedian_imputer = MeanMedianImputer(\n    imputation_method='median',\n    variables=['LotFrontage', 'MasVnrArea']\n)\n\n# Fit the imputer on the training data\nmedian_imputer.fit(X_train)\n\n# Transform both training and test data\nX_train_imputed = median_imputer.transform(X_train)\nX_test_imputed = median_imputer.transform(X_test)\n\nprint(\"Missing values in 'LotFrontage' before imputation (train):\", X_train['LotFrontage'].isnull().sum())\nprint(\"Missing values in 'LotFrontage' after imputation (train):\", X_train_imputed['LotFrontage'].isnull().sum())\nprint(\"Missing values in 'MasVnrArea' before imputation (test):\", X_test['MasVnrArea'].isnull().sum())\nprint(\"Missing values in 'MasVnrArea' after imputation (test):\", X_test_imputed['MasVnrArea'].isnull().sum())","lang":"python","description":"This quickstart demonstrates how to use `feature-engine`'s `MeanMedianImputer` to handle missing data. It loads a dataset, splits it into training and testing sets, fits the imputer on the training data, and then transforms both sets. This follows the standard Scikit-learn `fit()` and `transform()` pattern, ensuring proper parameter learning from training data."},"warnings":[{"fix":"Update import statements to the new module paths. Refer to the official documentation or the v1.0.0 release notes for a complete list of changes. E.g., `from feature_engine.imputation import MeanMedianImputer`.","message":"Module paths and some class names were renamed in v1.0.0 to better reflect their functionality and align with Scikit-learn conventions. For example, `feature_engine.imputers` became `feature_engine.imputation`.","severity":"breaking","affected_versions":">=1.0.0"},{"fix":"Ensure target variables are cast to `object` or `category` dtype before applying the encoder, or set the `ignore_format=True` parameter in the transformer if you intend to encode numerical variables (use with caution). Example: `df['numerical_col'] = df['numerical_col'].astype('object')`.","message":"Feature-engine's categorical encoders (e.g., `MeanEncoder`, `OneHotEncoder`) by default expect variables to be of pandas `object` or `category` dtype. Providing numerical variables without explicit handling will raise a `TypeError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Use `feature-engine`'s `MatchCategories` transformer (introduced in v1.5.0) as part of your pipeline to align categories between training and test sets. Alternatively, ensure your chosen encoder has a strategy for handling unseen categories (e.g., `handle_unknown='ignore'` or a custom mapping). For `OneHotEncoder` specifically, a bug fix in v1.1.2 addressed how it handles binary variables with `drop_last_binary=True`.","message":"When dealing with categorical features, it's common for the test set to contain categories not present in the training set ('unseen categories'). This can lead to errors during transformation, especially with certain encoding schemes.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If your code inspects the list of variables transformed, switch to using `transformer.variables_` instead of `transformer.variables` for robust behavior, especially in complex pipelines where `variables` might refer to initial input and `variables_` to final processed ones.","message":"In v1.1.0, most transformers gained a new attribute `variables_` which contains the names of the variables that were actually modified by the transformer. While the old `variables` attribute is generally retained, `variables_` should be preferred for consistency and accuracy.","severity":"breaking","affected_versions":">=1.1.0"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Convert the target numerical column(s) to 'object' or 'category' dtype before fitting the encoder, or set `ignore_format=True` in the encoder's constructor if you deliberately want to encode numerical columns (e.g., `df['col'] = df['col'].astype('object')`).","cause":"A categorical encoding transformer was applied to variables that are not of pandas 'object' or 'category' dtype, but rather numerical.","error":"TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"},{"fix":"Use `feature_engine.preprocessing.MatchCategories()` at the preprocessing stage to ensure consistent categories across train and test sets. Alternatively, review your chosen encoder's parameters for handling unknown categories (e.g., `RareLabelEncoder(tol=...)`, `OneHotEncoder(handle_unknown='ignore')`).","cause":"This often occurs when trying to directly access or manipulate categories that are present in the test set but were not seen in the training set by a fitted encoder or discretizer.","error":"KeyError: \"['some_category'] not in index\""}]}