Optimum Quanto
Optimum Quanto is a PyTorch quantization backend for Hugging Face Optimum, enabling efficient training and inference of large language models (LLMs) and other neural networks with reduced precision (e.g., 8-bit integers or 8-bit floats). It focuses on model optimization for hardware acceleration by integrating with PyTorch's native quantization functionalities. The current version is 0.2.7. As a rapidly evolving library deeply integrated with the Hugging Face ecosystem and PyTorch's quantization efforts, its release cadence is generally frequent, often tied to major Optimum or PyTorch updates.
Common errors
-
ImportError: cannot import name 'quantize' from 'optimum.quanto' (...)
cause `optimum-quanto` is not installed, or the `optimum` package itself is too old and does not include the `quanto` subpackage.fixEnsure `optimum-quanto` is installed (`pip install optimum-quanto`) and that your `optimum` package is updated to a compatible version (`pip install --upgrade optimum`). -
RuntimeError: The installed version of PyTorch (X.Y.Z) is not supported by optimum-quanto. Please install a compatible version (e.g., >=2.0.0, !=2.2.0, !=2.2.1, !=2.2.2).
cause Attempting to use `optimum-quanto` with a PyTorch version that is explicitly blacklisted due to known compatibility issues.fixUninstall the problematic PyTorch version and install a compatible one. For example: `pip uninstall torch torchvision torchaudio` followed by `pip install torch==2.1.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118` (adjust CUDA version if necessary) or `pip install torch==2.3.0`. -
ValueError: 'float8_e4m3fn' is not a valid QuantizationType for device 'cpu'
cause Attempting to use a specific, hardware-dependent quantization type (like `float8`) on a device that does not support it (e.g., CPU) or without the necessary software/driver configurations (e.g., specific CUDA compute capability).fixEither ensure you are running on compatible GPU hardware with appropriate drivers and PyTorch build, or switch to a more broadly supported quantization type for your device, such as `torch.int8`.
Warnings
- breaking Optimum Quanto has strict compatibility requirements with PyTorch versions. Specifically, `torch` versions `2.2.0`, `2.2.1`, and `2.2.2` are explicitly known to be incompatible and will cause runtime errors or incorrect behavior.
- gotcha Quantization, especially to lower precision types like int8 or float8, inherently involves a trade-off where model accuracy or performance on downstream tasks might degrade. This is an expected consequence of reducing the model's precision.
- gotcha The primary performance benefits of `optimum-quanto` (e.g., speedups from INT8 or FP8) are often contingent on specific hardware acceleration (e.g., NVIDIA GPUs with Tensor Cores or specific CPU instruction sets). Running quantized models on incompatible hardware might not yield expected speedups or could even be slower than the float32 baseline.
- gotcha After applying `quantize()` to a model, it is crucial to also call `freeze()` on the model, especially when preparing it for inference. Forgetting to freeze can prevent essential backend optimizations from taking effect, leading to suboptimal performance or incorrect behavior in some scenarios.
Install
-
pip install optimum-quanto
Imports
- quantize
from optimum.quanto import quantize
- freeze
from optimum.quanto import freeze
- set_qtype
from optimum.quanto import set_qtype
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze
# Load a pre-trained model (using a small one for quick execution)
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Ensure model is on a GPU if available, or compatible dtype
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)
# Define the target quantization type (e.g., 8-bit integer)
qtype = torch.int8
# Apply quantization to the model
# This converts weights to quantized tensors according to qtype
quantize(model, qtype=qtype)
# Freeze the quantized model for efficient inference
# This makes weights immutable and enables further backend optimizations
freeze(model)
print(f"Model quantized to {qtype} and frozen on {device}.")
# Example inference with the quantized model
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=50, top_p=0.95)
print("Generated text:")
print(tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True))