{"id":8385,"library":"optimum-quanto","title":"Optimum Quanto","description":"Optimum Quanto is a PyTorch quantization backend for Hugging Face Optimum, enabling efficient training and inference of large language models (LLMs) and other neural networks with reduced precision (e.g., 8-bit integers or 8-bit floats). It focuses on model optimization for hardware acceleration by integrating with PyTorch's native quantization functionalities. The current version is 0.2.7. As a rapidly evolving library deeply integrated with the Hugging Face ecosystem and PyTorch's quantization efforts, its release cadence is generally frequent, often tied to major Optimum or PyTorch updates.","status":"active","version":"0.2.7","language":"en","source_language":"en","source_url":"https://github.com/huggingface/optimum","tags":["quantization","pytorch","huggingface","llm","optimization","deep-learning","model-compression"],"install":[{"cmd":"pip install optimum-quanto","lang":"bash","label":"Install core library"}],"dependencies":[{"reason":"Core Hugging Face Optimum library, providing common interfaces for model loading and optimization. Specific version requirements for compatibility.","package":"optimum","optional":false},{"reason":"Primary deep learning framework. Specific versions are explicitly excluded due to known compatibility issues with quanto's quantization implementations.","package":"torch","optional":false},{"reason":"Hugging Face Transformers library, commonly used for loading pre-trained models that will be quantized.","package":"transformers","optional":false}],"imports":[{"note":"The primary function to apply quantization to a PyTorch module or its submodules.","symbol":"quantize","correct":"from optimum.quanto import quantize"},{"note":"Used to 'freeze' quantized weights, making them immutable and often enabling further performance optimizations for inference.","symbol":"freeze","correct":"from optimum.quanto import freeze"},{"note":"Function to dynamically set the quantization type (e.g., int8, float8) for specific modules or layers.","symbol":"set_qtype","correct":"from optimum.quanto import set_qtype"}],"quickstart":{"code":"import torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom optimum.quanto import quantize, freeze\n\n# Load a pre-trained model (using a small one for quick execution)\nmodel_id = \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n# Ensure model is on a GPU if available, or compatible dtype\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)\n\n# Define the target quantization type (e.g., 8-bit integer)\nqtype = torch.int8\n\n# Apply quantization to the model\n# This converts weights to quantized tensors according to qtype\nquantize(model, qtype=qtype)\n\n# Freeze the quantized model for efficient inference\n# This makes weights immutable and enables further backend optimizations\nfreeze(model)\n\nprint(f\"Model quantized to {qtype} and frozen on {device}.\")\n\n# Example inference with the quantized model\ninputs = tokenizer(\"Hello, my name is\", return_tensors=\"pt\").to(device)\nwith torch.no_grad():\n    outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=50, top_p=0.95)\nprint(\"Generated text:\")\nprint(tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True))\n","lang":"python","description":"This quickstart demonstrates how to load a pre-trained Hugging Face Transformers model, apply 8-bit integer quantization using `optimum-quanto`'s `quantize` function, and then `freeze` the model for efficient inference. It concludes with a basic text generation example to verify functionality. Ensure a compatible PyTorch version and potentially a CUDA-enabled GPU for best results."},"warnings":[{"fix":"Before installing, ensure your `torch` installation meets the requirements specified by `optimum-quanto`. Explicitly install a compatible version (e.g., `pip install torch==2.1.2` or `pip install torch==2.3.0` for CUDA).","message":"Optimum Quanto has strict compatibility requirements with PyTorch versions. Specifically, `torch` versions `2.2.0`, `2.2.1`, and `2.2.2` are explicitly known to be incompatible and will cause runtime errors or incorrect behavior.","severity":"breaking","affected_versions":"All versions of `optimum-quanto` dependent on `torch` versions 2.2.x."},{"fix":"Always evaluate your quantized model thoroughly on a representative validation dataset. If accuracy degradation is unacceptable, consider using Quantization-Aware Training (QAT) or opting for higher precision quantization (e.g., float16 if available) if your hardware supports it.","message":"Quantization, especially to lower precision types like int8 or float8, inherently involves a trade-off where model accuracy or performance on downstream tasks might degrade. This is an expected consequence of reducing the model's precision.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify that your target deployment hardware supports the desired quantization precision. Consult `quanto` and PyTorch documentation for recommended hardware configurations and validate performance through profiling on your specific device.","message":"The primary performance benefits of `optimum-quanto` (e.g., speedups from INT8 or FP8) are often contingent on specific hardware acceleration (e.g., NVIDIA GPUs with Tensor Cores or specific CPU instruction sets). Running quantized models on incompatible hardware might not yield expected speedups or could even be slower than the float32 baseline.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always include `freeze(model)` after `quantize(model, ...)` in your model preparation pipeline for inference. This ensures that the weights are finalized and enables the backend to apply its full set of optimizations.","message":"After applying `quantize()` to a model, it is crucial to also call `freeze()` on the model, especially when preparing it for inference. Forgetting to freeze can prevent essential backend optimizations from taking effect, leading to suboptimal performance or incorrect behavior in some scenarios.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure `optimum-quanto` is installed (`pip install optimum-quanto`) and that your `optimum` package is updated to a compatible version (`pip install --upgrade optimum`).","cause":"`optimum-quanto` is not installed, or the `optimum` package itself is too old and does not include the `quanto` subpackage.","error":"ImportError: cannot import name 'quantize' from 'optimum.quanto' (...)"},{"fix":"Uninstall the problematic PyTorch version and install a compatible one. For example: `pip uninstall torch torchvision torchaudio` followed by `pip install torch==2.1.2 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118` (adjust CUDA version if necessary) or `pip install torch==2.3.0`.","cause":"Attempting to use `optimum-quanto` with a PyTorch version that is explicitly blacklisted due to known compatibility issues.","error":"RuntimeError: The installed version of PyTorch (X.Y.Z) is not supported by optimum-quanto. Please install a compatible version (e.g., >=2.0.0, !=2.2.0, !=2.2.1, !=2.2.2)."},{"fix":"Either ensure you are running on compatible GPU hardware with appropriate drivers and PyTorch build, or switch to a more broadly supported quantization type for your device, such as `torch.int8`.","cause":"Attempting to use a specific, hardware-dependent quantization type (like `float8`) on a device that does not support it (e.g., CPU) or without the necessary software/driver configurations (e.g., specific CUDA compute capability).","error":"ValueError: 'float8_e4m3fn' is not a valid QuantizationType for device 'cpu'"}]}