Model Hosting Container Standards
The `model-hosting-container-standards` is a Python toolkit designed to facilitate standardized model hosting container implementations, specifically with robust Amazon SageMaker integration. It provides utilities to enable efficient deployment and inference for models, including support for advanced frameworks like TensorRT-LLM and vLLM. Currently at version 0.1.14, the library is actively developed with frequent patch releases, indicating ongoing enhancements and maintenance.
Warnings
- gotcha Avoid hardcoding sensitive information like API keys or Hugging Face tokens directly into container images or deployment scripts. Always use environment variables, AWS Secrets Manager, or other secure credential management systems for runtime injection.
- gotcha Failing to set resource limits (CPU/Memory) for containers can lead to resource starvation, instability, and poor performance, especially in multi-container environments on a single instance.
- gotcha Relying on generic or 'latest' tags for container images in production deployments can introduce instability and make reproducibility difficult due to unexpected upstream changes.
- gotcha Building overly large container images increases deployment times, storage costs, and the attack surface. This is a common issue with custom ML containers.
- breaking The library explicitly requires Python 3.10 or newer. Deploying or developing with older Python versions (e.g., 3.9 or earlier) will lead to compatibility issues and failures.
Install
-
pip install model-hosting-container-standards
Imports
- ModelHandler
from model_hosting_container_standards.common.handler.model_handler import ModelHandler
- FastAPI
from fastapi import FastAPI
Quickstart
import boto3
import os
sagemaker_client = boto3.client('sagemaker')
# Replace with your AWS account ID and region
account_id = os.environ.get('AWS_ACCOUNT_ID', '123456789012')
region = os.environ.get('AWS_REGION', 'us-east-1')
model_name = 'my-vllm-standard-model'
execution_role_arn = os.environ.get('SAGEMAKER_EXECUTION_ROLE_ARN', 'arn:aws:iam::123456789012:role/SageMakerExecutionRole')
# Example of using a vLLM container image that adheres to the standards
# This image would typically be found in Amazon ECR Public Gallery or a private ECR repo
# Note: This is an example, use an actual vLLM image URL from AWS ECR Public Gallery.
vllm_image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:0.11.2-sagemaker-v1.2"
response = sagemaker_client.create_model(
ModelName=model_name,
ExecutionRoleArn=execution_role_arn,
PrimaryContainer={
'Image': vllm_image,
'Environment': {
'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct', # Hugging Face Model ID or S3 path
'HUGGING_FACE_HUB_TOKEN': os.environ.get('HUGGING_FACE_HUB_TOKEN', ''), # Securely provide token
'SM_VLLM_MAX_MODEL_LEN': '2048',
'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.9',
'SM_VLLM_DTYPE': 'auto',
'SM_VLLM_TENSOR_PARALLEL_SIZE': '1'
}
}
)
print(f"Model creation initiated: {response['ModelArn']}")
# Further steps would involve creating an Endpoint Configuration and an Endpoint