NVIDIA Model Optimizer

0.42.0 · active · verified Wed Apr 15

NVIDIA Model Optimizer (nvidia-modelopt) is an open toolkit designed to accelerate AI inference by applying state-of-the-art model optimization techniques such as quantization, pruning, and distillation. It primarily targets PyTorch and ONNX models, integrating directly into the training loop and enabling seamless deployment to NVIDIA's inference frameworks like TensorRT-LLM and TensorRT. The library is actively developed, with its current stable version being 0.42.0, and frequent pre-release candidates (e.g., 0.43.0rcX) indicating a rapid release cadence.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a Hugging Face model and apply FP8 quantization using `NVIDIAModelOptConfig`. It shows the integration of Model Optimizer with popular deep learning frameworks and libraries like Hugging Face Diffusers to prepare models for efficient deployment.

import torch
from diffusers import AutoModel, NVIDIAModelOptConfig
from modelopt.torch.opt import enable_huggingface_checkpointing
import os # Required for os.environ.get if needed for token, though not direct in this example

# Enable checkpointing for Hugging Face models
enable_huggingface_checkpointing()

# Define the model ID and data type
model_id = "Efficient-Large-Model/Sana_600M_1024px_diffusers"
dtype = torch.bfloat16

# Define quantization configuration for FP8
# For simplicity, this example doesn't use os.environ.get as the model loading doesn't require explicit auth in this snippet.
# However, if your model required a Hugging Face token, you would pass token=os.environ.get('HF_TOKEN', '')
quantization_config = NVIDIAModelOptConfig(quant_type="FP8", quant_method="modelopt")

# Load the model with quantization configuration
# In a real scenario, ensure your environment has the necessary NVIDIA drivers and CUDA setup.
try:
    print(f"Attempting to load model {model_id} with FP8 quantization...")
    model = AutoModel.from_pretrained(
        model_id,
        subfolder="transformer",
        quantization_config=quantization_config,
        torch_dtype=dtype,
    )
    print("Model loaded successfully with quantization enabled.")
    # Example of a simple forward pass (replace with actual usage)
    # dummy_input = torch.randn(1, 3, 224, 224, dtype=dtype, device='cuda')
    # output = model(dummy_input)
    # print("Forward pass successful.")
    
    # To save the quantized model (requires a path)
    # model.save_pretrained('path/to/sana_fp8', safe_serialization=False)
except Exception as e:
    print(f"Error loading or processing model: {e}")
    print("Ensure you have `diffusers` installed, a compatible GPU, and potentially `--extra-index-url https://pypi.nvidia.com` during installation if encountering issues.")

view raw JSON →