SageMaker Scikit-learn Extension

2.5.0 · active · verified Tue Apr 14

An open-source library that extends scikit-learn functionalities, specifically designed for use with Amazon SageMaker. It provides robust encoders, time series feature extractors, and other transformers to streamline machine learning workflows on SageMaker. The current version is 2.5.0, with regular updates typically released every few months, focusing on new features and bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the `RobustOrdinalEncoder` to encode categorical data. It handles unknown categories by mapping them to a specified value, preventing errors that might occur with standard ordinal encoders when encountering new data. The example shows fitting and transforming a Pandas DataFrame column.

import pandas as pd
from sagemaker_sklearn_extension.encoders import RobustOrdinalEncoder

data = pd.DataFrame({
    'category': ['A', 'B', 'A', 'C', 'B', 'A', None, 'D'],
    'value': [10, 20, 15, 25, 30, 12, 18, 22]
})

# Initialize the encoder
# `max_categories` handles categories exceeding this limit as unseen.
# `handle_unknown='use_encoded_value'` ensures a specific value for unseen/nan.
encoder = RobustOrdinalEncoder(max_categories=3, handle_unknown='use_encoded_value', unknown_value=-1)

# Fit and transform the 'category' column
encoded_data = encoder.fit_transform(data[['category']])

print("Original Data:\n", data)
print("\nEncoded 'category' column:\n", encoded_data.reshape(-1))
print("\nLearned categories:", encoder.categories_[0])

view raw JSON →