Optimum Quanto

0.2.7 · active · verified Thu Apr 16

Optimum Quanto is a PyTorch quantization backend for Hugging Face Optimum, enabling efficient training and inference of large language models (LLMs) and other neural networks with reduced precision (e.g., 8-bit integers or 8-bit floats). It focuses on model optimization for hardware acceleration by integrating with PyTorch's native quantization functionalities. The current version is 0.2.7. As a rapidly evolving library deeply integrated with the Hugging Face ecosystem and PyTorch's quantization efforts, its release cadence is generally frequent, often tied to major Optimum or PyTorch updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a pre-trained Hugging Face Transformers model, apply 8-bit integer quantization using `optimum-quanto`'s `quantize` function, and then `freeze` the model for efficient inference. It concludes with a basic text generation example to verify functionality. Ensure a compatible PyTorch version and potentially a CUDA-enabled GPU for best results.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.quanto import quantize, freeze

# Load a pre-trained model (using a small one for quick execution)
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Ensure model is on a GPU if available, or compatible dtype
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)

# Define the target quantization type (e.g., 8-bit integer)
qtype = torch.int8

# Apply quantization to the model
# This converts weights to quantized tensors according to qtype
quantize(model, qtype=qtype)

# Freeze the quantized model for efficient inference
# This makes weights immutable and enables further backend optimizations
freeze(model)

print(f"Model quantized to {qtype} and frozen on {device}.")

# Example inference with the quantized model
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=50, top_p=0.95)
print("Generated text:")
print(tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True))

view raw JSON →