{"id":6232,"library":"sagemaker-scikit-learn-extension","title":"SageMaker Scikit-learn Extension","description":"An open-source library that extends scikit-learn functionalities, specifically designed for use with Amazon SageMaker. It provides robust encoders, time series feature extractors, and other transformers to streamline machine learning workflows on SageMaker. The current version is 2.5.0, with regular updates typically released every few months, focusing on new features and bug fixes.","status":"active","version":"2.5.0","language":"en","source_language":"en","source_url":"https://github.com/aws/sagemaker-scikit-learn-extension/","tags":["aws","sagemaker","machine-learning","scikit-learn","feature-engineering","time-series","data-preprocessing"],"install":[{"cmd":"pip install sagemaker-scikit-learn-extension","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core dependency for scikit-learn compatible estimators.","package":"scikit-learn","optional":false},{"reason":"Internal data handling and processing library.","package":"mlio","optional":false},{"reason":"Used by TSFreshExtractor for time series feature engineering.","package":"tsfresh","optional":false},{"reason":"Data manipulation for various transformers.","package":"pandas","optional":false}],"imports":[{"symbol":"RobustOrdinalEncoder","correct":"from sagemaker_sklearn_extension.encoders import RobustOrdinalEncoder"},{"symbol":"TSFreshExtractor","correct":"from sagemaker_sklearn_extension.feature_extraction.timeseries import TSFreshExtractor"},{"symbol":"WeightOfEvidenceEncoder","correct":"from sagemaker_sklearn_extension.encoders import WeightOfEvidenceEncoder"},{"symbol":"ThresholdOneHotEncoder","correct":"from sagemaker_sklearn_extension.preprocessing import ThresholdOneHotEncoder"}],"quickstart":{"code":"import pandas as pd\nfrom sagemaker_sklearn_extension.encoders import RobustOrdinalEncoder\n\ndata = pd.DataFrame({\n    'category': ['A', 'B', 'A', 'C', 'B', 'A', None, 'D'],\n    'value': [10, 20, 15, 25, 30, 12, 18, 22]\n})\n\n# Initialize the encoder\n# `max_categories` handles categories exceeding this limit as unseen.\n# `handle_unknown='use_encoded_value'` ensures a specific value for unseen/nan.\nencoder = RobustOrdinalEncoder(max_categories=3, handle_unknown='use_encoded_value', unknown_value=-1)\n\n# Fit and transform the 'category' column\nencoded_data = encoder.fit_transform(data[['category']])\n\nprint(\"Original Data:\\n\", data)\nprint(\"\\nEncoded 'category' column:\\n\", encoded_data.reshape(-1))\nprint(\"\\nLearned categories:\", encoder.categories_[0])\n","lang":"python","description":"This quickstart demonstrates how to use the `RobustOrdinalEncoder` to encode categorical data. It handles unknown categories by mapping them to a specified value, preventing errors that might occur with standard ordinal encoders when encountering new data. The example shows fitting and transforming a Pandas DataFrame column."},"warnings":[{"fix":"Ensure your environment has `scikit-learn` and `mlio` installed within the specified version ranges. Upgrade them if necessary: `pip install 'scikit-learn>=0.23,<1.2' 'mlio>=0.5,<0.6'`.","message":"Version 2.0.0 introduced breaking changes by updating core dependencies. It requires `scikit-learn>=0.23,<1.2` and `mlio>=0.5,<0.6`. Earlier versions of these libraries are no longer supported.","severity":"breaking","affected_versions":">=2.0.0"},{"fix":"Always use named functions instead of anonymous lambda functions when defining custom logic within estimators that need to be serialized (e.g., using `pickle` for model deployment).","message":"When serializing estimators that use custom functions (like `DateTimeDefinitions` in `TSFreshExtractor`), avoid using lambda functions. Version 2.1.0 addressed `pickle` issues by recommending named functions for better serialization stability.","severity":"gotcha","affected_versions":"All"},{"fix":"Be aware of this limitation when deploying models using `TSFreshExtractor` to SageMaker inference endpoints. Optimize input data or pre-compute features if latency becomes an issue in `sagemaker_serve` environments.","message":"The `TSFreshExtractor`'s parallelism feature is explicitly disabled when running in a `sagemaker_serve` execution environment. This is by design to prevent resource contention during inference, but it means feature extraction might be slower in such contexts.","severity":"gotcha","affected_versions":">=2.5.0"},{"fix":"Pin your `tsfresh` dependency to the required range: `pip install 'tsfresh>=0.17.0,<0.18.0'`.","message":"Version 2.5.0 includes bug fixes related to `tsfresh` dependency. The library now expects `tsfresh>=0.17.0,<0.18.0`. Incompatible versions of `tsfresh` may lead to broken functionality or runtime errors, especially with `TSFreshExtractor`.","severity":"breaking","affected_versions":">=2.5.0"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}