{"id":6744,"library":"nvidia-modelopt","title":"NVIDIA Model Optimizer","description":"NVIDIA Model Optimizer (nvidia-modelopt) is an open toolkit designed to accelerate AI inference by applying state-of-the-art model optimization techniques such as quantization, pruning, and distillation. It primarily targets PyTorch and ONNX models, integrating directly into the training loop and enabling seamless deployment to NVIDIA's inference frameworks like TensorRT-LLM and TensorRT. The library is actively developed, with its current stable version being 0.42.0, and frequent pre-release candidates (e.g., 0.43.0rcX) indicating a rapid release cadence.","status":"active","version":"0.42.0","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/Model-Optimizer","tags":["AI","Machine Learning","Optimization","Quantization","Pruning","Distillation","PyTorch","ONNX","NVIDIA","GPU","Inference"],"install":[{"cmd":"pip install nvidia-modelopt","lang":"bash","label":"Base Installation"},{"cmd":"pip install \"nvidia-modelopt[all]\" --extra-index-url https://pypi.nvidia.com","lang":"bash","label":"Full Installation (Recommended)"}],"dependencies":[{"reason":"Required Python version range.","package":"python","version":">=3.10, <3.13","optional":false},{"reason":"Core dependency for PyTorch model optimization.","package":"torch","version":">=2.6","optional":false},{"reason":"Used for configuration validation.","package":"pydantic","version":">=2.0","optional":false},{"reason":"Optional dependency for ONNX model optimization, included with '[onnx]' or '[all]' extra.","package":"onnx","version":"~=1.19.0","optional":true},{"reason":"Optional dependency for ONNX runtime, included with '[onnx]' or '[all]' extra (platform/python version dependent).","package":"onnxruntime-gpu","version":"~=1.24.2","optional":true}],"imports":[{"symbol":"NVIDIAModelOptConfig","correct":"from diffusers import NVIDIAModelOptConfig"},{"symbol":"enable_huggingface_checkpointing","correct":"from modelopt.torch.opt import enable_huggingface_checkpointing"},{"symbol":"quantization","correct":"import modelopt.torch.quantization as mtq"},{"symbol":"export_hf_checkpoint","correct":"from modelopt.torch.export import export_hf_checkpoint"}],"quickstart":{"code":"import torch\nfrom diffusers import AutoModel, NVIDIAModelOptConfig\nfrom modelopt.torch.opt import enable_huggingface_checkpointing\nimport os # Required for os.environ.get if needed for token, though not direct in this example\n\n# Enable checkpointing for Hugging Face models\nenable_huggingface_checkpointing()\n\n# Define the model ID and data type\nmodel_id = \"Efficient-Large-Model/Sana_600M_1024px_diffusers\"\ndtype = torch.bfloat16\n\n# Define quantization configuration for FP8\n# For simplicity, this example doesn't use os.environ.get as the model loading doesn't require explicit auth in this snippet.\n# However, if your model required a Hugging Face token, you would pass token=os.environ.get('HF_TOKEN', '')\nquantization_config = NVIDIAModelOptConfig(quant_type=\"FP8\", quant_method=\"modelopt\")\n\n# Load the model with quantization configuration\n# In a real scenario, ensure your environment has the necessary NVIDIA drivers and CUDA setup.\ntry:\n    print(f\"Attempting to load model {model_id} with FP8 quantization...\")\n    model = AutoModel.from_pretrained(\n        model_id,\n        subfolder=\"transformer\",\n        quantization_config=quantization_config,\n        torch_dtype=dtype,\n    )\n    print(\"Model loaded successfully with quantization enabled.\")\n    # Example of a simple forward pass (replace with actual usage)\n    # dummy_input = torch.randn(1, 3, 224, 224, dtype=dtype, device='cuda')\n    # output = model(dummy_input)\n    # print(\"Forward pass successful.\")\n    \n    # To save the quantized model (requires a path)\n    # model.save_pretrained('path/to/sana_fp8', safe_serialization=False)\nexcept Exception as e:\n    print(f\"Error loading or processing model: {e}\")\n    print(\"Ensure you have `diffusers` installed, a compatible GPU, and potentially `--extra-index-url https://pypi.nvidia.com` during installation if encountering issues.\")\n","lang":"python","description":"This quickstart demonstrates how to load a Hugging Face model and apply FP8 quantization using `NVIDIAModelOptConfig`. It shows the integration of Model Optimizer with popular deep learning frameworks and libraries like Hugging Face Diffusers to prepare models for efficient deployment."},"warnings":[{"fix":"Use `pip install \"nvidia-modelopt[all]\" --extra-index-url https://pypi.nvidia.com` for a comprehensive installation.","message":"For full functionality, especially with pre-release versions or specific NVIDIA-optimized components, it is often necessary to install `nvidia-modelopt` using `--extra-index-url https://pypi.nvidia.com`. Without this, certain features or versions might not be available or compatible.","severity":"gotcha","affected_versions":"All versions, especially when using pre-releases (e.g., 0.43.0rcX) or specific NVIDIA integrations."},{"fix":"For applications requiring `num_query_groups` in Minitron pruning, consider using ModelOpt 0.40.0 or earlier, or adapt your pruning strategy to newer APIs if available in the current version.","message":"The `num_query_groups` parameter in Minitron pruning (specifically for `mcore_minitron`) was deprecated. If you relied on this for pruning, you might need to use an older version of ModelOpt.","severity":"breaking","affected_versions":"0.41.0 and later."},{"fix":"Ensure your Python environment is within the supported range (`python<3.13,>=3.10`). Check `requires_python` on PyPI for the most up-to-date requirements.","message":"NVIDIA Model Optimizer has specific Python version requirements (currently Python >=3.10, <3.13). Using incompatible Python versions can lead to installation failures or runtime errors.","severity":"gotcha","affected_versions":"All versions."},{"fix":"Users should plan for integration with NVIDIA's TensorRT ecosystem to fully leverage the performance benefits of optimized models. Refer to TensorRT and TensorRT-LLM documentation for deployment best practices.","message":"The actual inference performance gains from model optimization (quantization, pruning, distillation) depend heavily on the downstream deployment framework (e.g., TensorRT-LLM, TensorRT) and the specific hardware configuration. `nvidia-modelopt` optimizes the model, but the runtime performance is realized by these specialized inference engines.","severity":"gotcha","affected_versions":"All versions."},{"fix":"Verify the ONNX opset version of your models if encountering issues with quantization. ModelOpt will generally handle upgrades, but manual inspection might be necessary for debugging.","message":"When working with ONNX models, specific opset versions are required for certain quantization types (e.g., INT8 requires opset 13+, FP8 and INT4 require opset 21+). While ModelOpt can automatically upgrade lower opset versions, awareness of these requirements can prevent unexpected behavior or errors.","severity":"gotcha","affected_versions":"All versions involving ONNX model quantization."}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}