Category Encoders
Category Encoders is a Python library providing a comprehensive set of scikit-learn-style transformers for encoding categorical variables into numeric representations using various techniques. It offers first-class support for pandas DataFrames as input and output, and integrates seamlessly with scikit-learn pipelines. The library is actively maintained, with the current version being 2.9.0, and releases occur regularly to introduce new encoders, features, and bug fixes.
Warnings
- breaking Breaking changes in version 2.x removed support for older Python, pandas, and scikit-learn versions. Specifically, `category-encoders` v2.x requires Python >=3.11, pandas >=1.0, and dropped support for scikit-learn 0.x.
- breaking Default parameters for some encoders, such as `TargetEncoder` (issue 327) and `HelmertEncoder` (`handle_missing`, `handle_unknown`), changed in minor 2.x releases. This can subtly alter encoding behavior compared to earlier versions.
- gotcha For supervised encoders (e.g., `TargetEncoder`, `LeaveOneOutEncoder`), always use `fit_transform(X_train, y_train)` for training data and `transform(X_test)` for test data. Using `fit().transform()` on training data might lead to different results, as `fit_transform` often employs techniques like nested cross-validation to prevent overfitting during training.
- gotcha Handling unknown categories in new data (e.g., in a production environment) can lead to errors or unexpected values. The `handle_unknown` parameter's default behavior varies by encoder; for `TargetEncoder`, it defaults to the target mean.
- gotcha If the `cols` parameter is not provided during encoder instantiation, `category-encoders` will attempt to encode *all* non-numeric columns (object or pandas categorical dtype). This can unintentionally encode ID columns or numerical columns that were loaded as strings.
- gotcha Using `OrdinalEncoder` for nominal (unordered) categorical variables can introduce an artificial, misleading order into the data, which may negatively impact models sensitive to numerical relationships (e.g., linear models).
- gotcha Installing `category-encoders` via `conda-forge` might provide an older version of the library (e.g., 1.x) that lacks recent features, bug fixes, or compatibility updates present in the latest pip release.
Install
-
pip install category-encoders
Imports
- TargetEncoder
from category_encoders import TargetEncoder
- OneHotEncoder
from category_encoders import OneHotEncoder
- OrdinalEncoder
from category_encoders import OrdinalEncoder
- BinaryEncoder
from category_encoders import BinaryEncoder
Quickstart
import pandas as pd
import category_encoders as ce
# Sample Data
data = {
'city': ['New York', 'London', 'Paris', 'New York', 'London', 'Berlin'],
'country': ['USA', 'UK', 'France', 'USA', 'UK', 'Germany'],
'target': [10, 20, 15, 12, 22, 18]
}
df = pd.DataFrame(data)
# Initialize and fit the TargetEncoder
# It's crucial to specify 'cols' to encode specific columns.
# For supervised encoders, 'y' is passed during fit_transform.
encoder = ce.TargetEncoder(cols=['city', 'country'])
encoded_df = encoder.fit_transform(df, df['target'])
print("Original DataFrame:")
print(df)
print("\nEncoded DataFrame:")
print(encoded_df)