{"library":"optimum-quanto","title":"Optimum Quanto","description":"Optimum Quanto is a PyTorch quantization backend for Hugging Face Optimum, enabling efficient training and inference of large language models (LLMs) and other neural networks with reduced precision (e.g., 8-bit integers or 8-bit floats). It focuses on model optimization for hardware acceleration by integrating with PyTorch's native quantization functionalities. The current version is 0.2.7. As a rapidly evolving library deeply integrated with the Hugging Face ecosystem and PyTorch's quantization efforts, its release cadence is generally frequent, often tied to major Optimum or PyTorch updates.","language":"python","status":"active","last_verified":"Mon May 18","install":{"commands":["pip install optimum-quanto"],"cli":null},"imports":["from optimum.quanto import quantize","from optimum.quanto import freeze","from optimum.quanto import set_qtype"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom optimum.quanto import quantize, freeze\n\n# Load a pre-trained model (using a small one for quick execution)\nmodel_id = \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n# Ensure model is on a GPU if available, or compatible dtype\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).to(device)\n\n# Define the target quantization type (e.g., 8-bit integer)\nqtype = torch.int8\n\n# Apply quantization to the model\n# This converts weights to quantized tensors according to qtype\nquantize(model, qtype=qtype)\n\n# Freeze the quantized model for efficient inference\n# This makes weights immutable and enables further backend optimizations\nfreeze(model)\n\nprint(f\"Model quantized to {qtype} and frozen on {device}.\")\n\n# Example inference with the quantized model\ninputs = tokenizer(\"Hello, my name is\", return_tensors=\"pt\").to(device)\nwith torch.no_grad():\n    outputs = model.generate(**inputs, max_new_tokens=20, do_sample=True, top_k=50, top_p=0.95)\nprint(\"Generated text:\")\nprint(tokenizer.decode(outputs[0].cpu(), skip_special_tokens=True))\n","lang":"python","description":"This quickstart demonstrates how to load a pre-trained Hugging Face Transformers model, apply 8-bit integer quantization using `optimum-quanto`'s `quantize` function, and then `freeze` the model for efficient inference. It concludes with a basic text generation example to verify functionality. Ensure a compatible PyTorch version and potentially a CUDA-enabled GPU for best results.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-18","installed_version":"0.2.7","pypi_latest":"0.2.7","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":40,"avg_install_s":69.6,"avg_import_s":7.18,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"optimum-quanto","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":79.8,"import_time_s":5.09,"mem_mb":73.8,"disk_size":"4.7G"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"optimum-quanto","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":75.5,"import_time_s":8.58,"mem_mb":79.5,"disk_size":"4.8G"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"optimum-quanto","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":62.7,"import_time_s":9.02,"mem_mb":78.2,"disk_size":"4.8G"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"optimum-quanto","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"clean","install_time_s":60.5,"import_time_s":6.04,"mem_mb":78.7,"disk_size":"4.8G"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"optimum-quanto","exit_code":1,"wheel_type":null,"failure_reason":"timeout","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null}]}}