AutoGPTQ

0.7.1 verified Fri May 01 auth: no python

AutoGPTQ is an easy-to-use LLMs quantization package based on the GPTQ algorithm. It provides user-friendly APIs for quantizing and running large language models with reduced memory usage. Current version 0.7.1 supports loading sharded quantized checkpoints and Gemma models. Release cadence is irregular, with major features in point releases.

pip install auto-gptq

Common errors

error AttributeError: module 'auto_gptq' has no attribute 'AutoGPTQForCausalLM' ↓

cause Incorrect import path; trying to import from submodule instead of top-level.

fix

Use from auto_gptq import AutoGPTQForCausalLM

error ImportError: cannot import name 'exllama_set_max_input_length' from 'auto_gptq' ↓

cause Using an older version that doesn't have this function (introduced in v0.4.2).

fix

Upgrade to auto-gptq>=0.4.2: pip install --upgrade auto-gptq

error RuntimeError: CUDA error: out of memory ↓

cause Exllama kernel's default max input length is 2048; exceeding this causes memory issues.

fix

Call exllama_set_max_input_length(model, new_length) after loading model and before generation.

error TypeError: from_quantized() got an unexpected keyword argument 'save_dir' ↓

cause The `save_dir` argument was removed in v0.3.2.

fix

Use model_name_or_path instead: e.g., AutoGPTQForCausalLM.from_quantized('model_path_or_name')

Warnings

gotcha The exllama kernel is the default backend for inference. If you do not install the optional exllama dependency, the package may fall back to a slower backend or raise an error. Install with `pip install auto-gptq[exllama]`. ↓

fix Install auto-gptq with exllama extra: `pip install auto-gptq[exllama]`

breaking In v0.3.2, the `save_dir` argument was removed from `from_quantized`. Now only `model_name_or_path` is supported. ↓

fix Use `model_name_or_path` instead of `save_dir` when loading quantized models.

deprecated The `use_triton` parameter in `from_quantized` is deprecated since v0.5.0. Triton backend is no longer supported; use exllama or cuda backends. ↓

fix Remove `use_triton` argument. The backend is automatically selected (exllama preferred).

gotcha When using the exllama backend, the maximum input length defaults to 2048. For longer sequences, call `exllama_set_max_input_length` before generation. Otherwise, you may get an error like 'CUDA out of memory' or silent failures. ↓

fix Call `from auto_gptq import exllama_set_max_input_length; model = exllama_set_max_input_length(model, 4096)` after loading.

gotcha Quantization with `desc_act=True` and `group_size` may not be supported with all kernels. The cuda kernel before v0.3.2 had a bug when using both. Ensure you use v0.3.2+ if you need that combination. ↓

fix Upgrade to v0.3.2 or later.

Install

pip install auto-gptq[exllama]

Imports

AutoGPTQForCausalLM
wrong
```
from auto_gptq.modeling import AutoGPTQForCausalLM
```
correct
```
from auto_gptq import AutoGPTQForCausalLM
```
AutoGPTQForCausalLM is exposed at package level, not in a submodule.

BaseQuantizeConfig

from auto_gptq import BaseQuantizeConfig

exllama_set_max_input_length

from auto_gptq import exllama_set_max_input_length

Quickstart

Quickstart for quantizing and running a model (quantization not shown for simplicity, but the config is set).

import torch
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=False,
)

model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config,
)

text = "AutoGPTQ is"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))