AutoGPTQ
raw JSON → 0.7.1 verified Fri May 01 auth: no python
AutoGPTQ is an easy-to-use LLMs quantization package based on the GPTQ algorithm. It provides user-friendly APIs for quantizing and running large language models with reduced memory usage. Current version 0.7.1 supports loading sharded quantized checkpoints and Gemma models. Release cadence is irregular, with major features in point releases.
pip install auto-gptq Common errors
error AttributeError: module 'auto_gptq' has no attribute 'AutoGPTQForCausalLM' ↓
cause Incorrect import path; trying to import from submodule instead of top-level.
fix
Use
from auto_gptq import AutoGPTQForCausalLM error ImportError: cannot import name 'exllama_set_max_input_length' from 'auto_gptq' ↓
cause Using an older version that doesn't have this function (introduced in v0.4.2).
fix
Upgrade to auto-gptq>=0.4.2:
pip install --upgrade auto-gptq error RuntimeError: CUDA error: out of memory ↓
cause Exllama kernel's default max input length is 2048; exceeding this causes memory issues.
fix
Call
exllama_set_max_input_length(model, new_length) after loading model and before generation. error TypeError: from_quantized() got an unexpected keyword argument 'save_dir' ↓
cause The `save_dir` argument was removed in v0.3.2.
fix
Use
model_name_or_path instead: e.g., AutoGPTQForCausalLM.from_quantized('model_path_or_name') Warnings
gotcha The exllama kernel is the default backend for inference. If you do not install the optional exllama dependency, the package may fall back to a slower backend or raise an error. Install with `pip install auto-gptq[exllama]`. ↓
fix Install auto-gptq with exllama extra: `pip install auto-gptq[exllama]`
breaking In v0.3.2, the `save_dir` argument was removed from `from_quantized`. Now only `model_name_or_path` is supported. ↓
fix Use `model_name_or_path` instead of `save_dir` when loading quantized models.
deprecated The `use_triton` parameter in `from_quantized` is deprecated since v0.5.0. Triton backend is no longer supported; use exllama or cuda backends. ↓
fix Remove `use_triton` argument. The backend is automatically selected (exllama preferred).
gotcha When using the exllama backend, the maximum input length defaults to 2048. For longer sequences, call `exllama_set_max_input_length` before generation. Otherwise, you may get an error like 'CUDA out of memory' or silent failures. ↓
fix Call `from auto_gptq import exllama_set_max_input_length; model = exllama_set_max_input_length(model, 4096)` after loading.
gotcha Quantization with `desc_act=True` and `group_size` may not be supported with all kernels. The cuda kernel before v0.3.2 had a bug when using both. Ensure you use v0.3.2+ if you need that combination. ↓
fix Upgrade to v0.3.2 or later.
Install
pip install auto-gptq[exllama] Imports
- AutoGPTQForCausalLM wrong
from auto_gptq.modeling import AutoGPTQForCausalLMcorrectfrom auto_gptq import AutoGPTQForCausalLM - BaseQuantizeConfig
from auto_gptq import BaseQuantizeConfig - exllama_set_max_input_length
from auto_gptq import exllama_set_max_input_length
Quickstart
import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
model_id,
quantize_config=quantize_config,
)
text = "AutoGPTQ is"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))