SageMaker Scikit-learn Extension
An open-source library that extends scikit-learn functionalities, specifically designed for use with Amazon SageMaker. It provides robust encoders, time series feature extractors, and other transformers to streamline machine learning workflows on SageMaker. The current version is 2.5.0, with regular updates typically released every few months, focusing on new features and bug fixes.
Warnings
- breaking Version 2.0.0 introduced breaking changes by updating core dependencies. It requires `scikit-learn>=0.23,<1.2` and `mlio>=0.5,<0.6`. Earlier versions of these libraries are no longer supported.
- gotcha When serializing estimators that use custom functions (like `DateTimeDefinitions` in `TSFreshExtractor`), avoid using lambda functions. Version 2.1.0 addressed `pickle` issues by recommending named functions for better serialization stability.
- gotcha The `TSFreshExtractor`'s parallelism feature is explicitly disabled when running in a `sagemaker_serve` execution environment. This is by design to prevent resource contention during inference, but it means feature extraction might be slower in such contexts.
- breaking Version 2.5.0 includes bug fixes related to `tsfresh` dependency. The library now expects `tsfresh>=0.17.0,<0.18.0`. Incompatible versions of `tsfresh` may lead to broken functionality or runtime errors, especially with `TSFreshExtractor`.
Install
-
pip install sagemaker-scikit-learn-extension
Imports
- RobustOrdinalEncoder
from sagemaker_sklearn_extension.encoders import RobustOrdinalEncoder
- TSFreshExtractor
from sagemaker_sklearn_extension.feature_extraction.timeseries import TSFreshExtractor
- WeightOfEvidenceEncoder
from sagemaker_sklearn_extension.encoders import WeightOfEvidenceEncoder
- ThresholdOneHotEncoder
from sagemaker_sklearn_extension.preprocessing import ThresholdOneHotEncoder
Quickstart
import pandas as pd
from sagemaker_sklearn_extension.encoders import RobustOrdinalEncoder
data = pd.DataFrame({
'category': ['A', 'B', 'A', 'C', 'B', 'A', None, 'D'],
'value': [10, 20, 15, 25, 30, 12, 18, 22]
})
# Initialize the encoder
# `max_categories` handles categories exceeding this limit as unseen.
# `handle_unknown='use_encoded_value'` ensures a specific value for unseen/nan.
encoder = RobustOrdinalEncoder(max_categories=3, handle_unknown='use_encoded_value', unknown_value=-1)
# Fit and transform the 'category' column
encoded_data = encoder.fit_transform(data[['category']])
print("Original Data:\n", data)
print("\nEncoded 'category' column:\n", encoded_data.reshape(-1))
print("\nLearned categories:", encoder.categories_[0])