Model Hosting Container Standards

0.1.14 · active · verified Fri Apr 10

The `model-hosting-container-standards` is a Python toolkit designed to facilitate standardized model hosting container implementations, specifically with robust Amazon SageMaker integration. It provides utilities to enable efficient deployment and inference for models, including support for advanced frameworks like TensorRT-LLM and vLLM. Currently at version 0.1.14, the library is actively developed with frequent patch releases, indicating ongoing enhancements and maintenance.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to deploy a model using Amazon SageMaker, leveraging a container that adheres to the `model-hosting-container-standards`. It configures a SageMaker model with a vLLM-powered container image, setting crucial environment variables for model ID, resource allocation, and optional authentication tokens. This example assumes appropriate AWS credentials and SageMaker execution role are configured in your environment. Note that the toolkit itself is for *building* such containers, and this quickstart shows how to *consume* them on SageMaker.

import boto3
import os

sagemaker_client = boto3.client('sagemaker')

# Replace with your AWS account ID and region
account_id = os.environ.get('AWS_ACCOUNT_ID', '123456789012')
region = os.environ.get('AWS_REGION', 'us-east-1')

model_name = 'my-vllm-standard-model'
execution_role_arn = os.environ.get('SAGEMAKER_EXECUTION_ROLE_ARN', 'arn:aws:iam::123456789012:role/SageMakerExecutionRole')

# Example of using a vLLM container image that adheres to the standards
# This image would typically be found in Amazon ECR Public Gallery or a private ECR repo
# Note: This is an example, use an actual vLLM image URL from AWS ECR Public Gallery.
vllm_image = f"{account_id}.dkr.ecr.{region}.amazonaws.com/vllm:0.11.2-sagemaker-v1.2"

response = sagemaker_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=execution_role_arn,
    PrimaryContainer={
        'Image': vllm_image,
        'Environment': {
            'SM_VLLM_MODEL': 'meta-llama/Meta-Llama-3-8B-Instruct', # Hugging Face Model ID or S3 path
            'HUGGING_FACE_HUB_TOKEN': os.environ.get('HUGGING_FACE_HUB_TOKEN', ''), # Securely provide token
            'SM_VLLM_MAX_MODEL_LEN': '2048',
            'SM_VLLM_GPU_MEMORY_UTILIZATION': '0.9',
            'SM_VLLM_DTYPE': 'auto',
            'SM_VLLM_TENSOR_PARALLEL_SIZE': '1'
        }
    }
)

print(f"Model creation initiated: {response['ModelArn']}")
# Further steps would involve creating an Endpoint Configuration and an Endpoint

view raw JSON →