SageMaker Inference Toolkit
The sagemaker-inference toolkit is an open-source Python library designed to simplify the creation of serving containers for machine learning models on Amazon SageMaker. It provides a model serving stack built on Multi Model Server (MMS), enabling users to easily implement custom inference logic. The current version is 1.10.1, with a regular release cadence addressing bug fixes and new features, including support for newer Python versions and improved dependency management.
Common errors
-
psutil.ZombieProcess: PID still exists but it's a zombie
cause A race condition or issue in process monitoring within the model server, often related to the `psutil` library, causing the inference process to incorrectly identify a running process as a zombie. This was particularly prevalent in certain PyTorch inference containers.fixUpdate `sagemaker-inference` to version 1.10.1 or newer. If using PyTorch, ensure your PyTorch inference DLC version is recent enough to include the fix. -
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (413) from primary and could not load the entire response body
cause This error (HTTP 413 Payload Too Large) typically indicates that the inference request payload size exceeds the server's configured limit, or less commonly, that a worker timeout occurred if the payload is within limits.fixIncrease the `SAGEMAKER_MAX_REQUEST_SIZE` environment variable for your endpoint. If the payload is small, check `SAGEMAKER_MODEL_SERVER_TIMEOUT_SECONDS` (introduced in v1.9.3) and `InvocationTimeoutSeconds` on the endpoint, and increase worker count if necessary. [cite: GitHub releases, 24] -
ModuleNotFoundError: No module named 'sagemaker_inference'
cause The `sagemaker-inference` library is not installed in the Python environment of your SageMaker container or local setup.fixEnsure `pip install sagemaker-inference` is executed in your Dockerfile (for containers) or local development environment. If using a custom Dockerfile, verify the `RUN pip install ...` command is correctly placed before your Python code runs.
Warnings
- breaking Handler functions (`model_fn`, `input_fn`, `predict_fn`, `output_fn`) were updated in v1.7.0 to optionally accept a `context` object. If you were using older versions and had strict function signatures without `context`, this update might require changes, though omitting `context` from the declaration is still supported if not needed.
- gotcha Persistent 'psutil.ZombieProcess: PID still exists but it's a zombie' errors can occur, leading to endpoint instability or restarts. This was a known issue with specific `psutil` versions and PyTorch inference images.
- gotcha When using custom Docker containers with `sagemaker-inference`, the `multi-model-server` (MMS) must be explicitly installed within your Dockerfile. Forgetting this can lead to the model server failing to start.
- gotcha Custom Python dependencies specified in `requirements.txt` might fail to install if they are hosted in a private repository like AWS CodeArtifact without proper configuration.
Install
-
pip install sagemaker-inference
Imports
- DefaultInferenceHandler
from sagemaker_inference.default_inference_handler import DefaultInferenceHandler
- model_server
from sagemaker_inference import model_server
- Transformer
from sagemaker_inference.transformer import Transformer
- DefaultHandlerService
from sagemaker_inference.default_handler_service import DefaultHandlerService
Quickstart
import os
import json
from sagemaker_inference.default_inference_handler import DefaultInferenceHandler
from sagemaker_inference import content_types, decoder, encoder
class CustomInferenceHandler(DefaultInferenceHandler):
def default_model_fn(self, model_dir, context=None):
"""Loads a dummy model for demonstration. In a real scenario, this would load
your actual trained model from `model_dir`.
"""
print(f"Loading model from: {model_dir}")
# Simulate loading a model artifact
# For example, if you had a 'model.pkl' in model_dir
# model_path = os.path.join(model_dir, 'model.pkl')
# model = joblib.load(model_path)
return {"status": "model_loaded", "path": model_dir}
def default_input_fn(self, input_data, content_type, context=None):
"""Deserializes the input data from the request. Supports JSON and CSV.
"""
if content_type == content_types.JSON:
return decoder.decode(input_data, content_type)
elif content_type == content_types.CSV:
# Assuming CSV is a simple string for this example
return input_data.decode('utf-8').split(',')
else:
raise ValueError(f"Unsupported content type: {content_type}")
def default_predict_fn(self, data, model, context=None):
"""Makes a dummy prediction based on the input data and the loaded model.
"""
print(f"Performing prediction with model: {model} and data: {data}")
if isinstance(data, dict) and 'instances' in data:
# Assume a common inference request format
predictions = [item * 2 for item in data['instances']]
elif isinstance(data, list):
predictions = [item + "_processed" for item in data]
else:
predictions = f"Processed: {data}"
return {"predictions": predictions}
def default_output_fn(self, prediction, accept, context=None):
"""Serializes the prediction result to the requested accept type.
Supports JSON.
"""
if accept == content_types.JSON:
return encoder.encode(prediction, accept)
else:
raise ValueError(f"Unsupported accept type: {accept}")
# To run this in a SageMaker container, you would have a Dockerfile
# that installs sagemaker-inference and multi-model-server, copies this file
# as 'inference.py' and sets up the entrypoint to start the model server.
# e.g., using sagemaker_inference.model_server.start_model_server()
# Example of how to manually test the handler (not typically run directly in a quickstart)
if __name__ == '__main__':
handler = CustomInferenceHandler()
model = handler.default_model_fn('/opt/ml/model') # Simulates model_dir
test_json_input = json.dumps({"instances": [1, 2, 3]}).encode('utf-8')
json_data = handler.default_input_fn(test_json_input, content_types.JSON)
json_prediction = handler.default_predict_fn(json_data, model)
json_output = handler.default_output_fn(json_prediction, content_types.JSON)
print(f"JSON Inference Result: {json_output.decode('utf-8')}")
test_csv_input = b'hello,world'
csv_data = handler.default_input_fn(test_csv_input, content_types.CSV)
csv_prediction = handler.default_predict_fn(csv_data, model)
csv_output = handler.default_output_fn(csv_prediction, content_types.JSON) # Output as JSON for simplicity
print(f"CSV Inference Result: {csv_output.decode('utf-8')}")