AutoGPTQ

raw JSON →
0.7.1 verified Fri May 01 auth: no python

AutoGPTQ is an easy-to-use LLMs quantization package based on the GPTQ algorithm. It provides user-friendly APIs for quantizing and running large language models with reduced memory usage. Current version 0.7.1 supports loading sharded quantized checkpoints and Gemma models. Release cadence is irregular, with major features in point releases.

pip install auto-gptq
error AttributeError: module 'auto_gptq' has no attribute 'AutoGPTQForCausalLM'
cause Incorrect import path; trying to import from submodule instead of top-level.
fix
Use from auto_gptq import AutoGPTQForCausalLM
error ImportError: cannot import name 'exllama_set_max_input_length' from 'auto_gptq'
cause Using an older version that doesn't have this function (introduced in v0.4.2).
fix
Upgrade to auto-gptq>=0.4.2: pip install --upgrade auto-gptq
error RuntimeError: CUDA error: out of memory
cause Exllama kernel's default max input length is 2048; exceeding this causes memory issues.
fix
Call exllama_set_max_input_length(model, new_length) after loading model and before generation.
error TypeError: from_quantized() got an unexpected keyword argument 'save_dir'
cause The `save_dir` argument was removed in v0.3.2.
fix
Use model_name_or_path instead: e.g., AutoGPTQForCausalLM.from_quantized('model_path_or_name')
gotcha The exllama kernel is the default backend for inference. If you do not install the optional exllama dependency, the package may fall back to a slower backend or raise an error. Install with `pip install auto-gptq[exllama]`.
fix Install auto-gptq with exllama extra: `pip install auto-gptq[exllama]`
breaking In v0.3.2, the `save_dir` argument was removed from `from_quantized`. Now only `model_name_or_path` is supported.
fix Use `model_name_or_path` instead of `save_dir` when loading quantized models.
deprecated The `use_triton` parameter in `from_quantized` is deprecated since v0.5.0. Triton backend is no longer supported; use exllama or cuda backends.
fix Remove `use_triton` argument. The backend is automatically selected (exllama preferred).
gotcha When using the exllama backend, the maximum input length defaults to 2048. For longer sequences, call `exllama_set_max_input_length` before generation. Otherwise, you may get an error like 'CUDA out of memory' or silent failures.
fix Call `from auto_gptq import exllama_set_max_input_length; model = exllama_set_max_input_length(model, 4096)` after loading.
gotcha Quantization with `desc_act=True` and `group_size` may not be supported with all kernels. The cuda kernel before v0.3.2 had a bug when using both. Ensure you use v0.3.2+ if you need that combination.
fix Upgrade to v0.3.2 or later.
pip install auto-gptq[exllama]

Quickstart for quantizing and running a model (quantization not shown for simplicity, but the config is set).

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
)

text = "AutoGPTQ is"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))