LLM Compressor
LLM Compressor (current version 0.10.0.1) is a Python library for compressing large language models, offering both training-aware and post-training techniques. Built on PyTorch and HuggingFace Transformers, it provides a flexible and user-friendly interface for researchers and practitioners to quickly experiment with techniques like quantization and sparsity. The library maintains an active development pace with frequent patch releases and regular feature updates.
Common errors
-
ModuleNotFoundError: No module named 'autoround'
cause The `autoround` library is an optional dependency and was not installed with the base `llmcompressor` package.fixInstall `llmcompressor` with the optional `autoround` dependency using `pip install llmcompressor[autoround]`. -
RuntimeError: CUDA error: invalid device ordinal
cause Attempting to perform GPU-accelerated operations on a system without a properly configured CUDA-enabled GPU, or trying to use a device index that does not exist.fixVerify that your system has a CUDA-enabled GPU, PyTorch is installed with CUDA support, and your code correctly assigns models/tensors to an available device (e.g., `model.to("cuda")`). For CPU-only environments, ensure operations are explicitly on CPU. -
ValueError: Could not parse recipe YAML: Unknown modifier 'QuantizationModifier'
cause The specified modifier class in the YAML recipe cannot be found or imported. This often indicates a typo, an outdated recipe format, or an older library version that doesn't support the modifier.fixCheck for typos in the modifier name. Ensure your `llmcompressor` library is up-to-date (`pip install --upgrade llmcompressor`). Verify the exact class name and module path in the official documentation if you are creating custom recipes.
Warnings
- gotcha LLM Compressor frequently updates its dependency on `compressed-tensors`. Mismatched versions between `llmcompressor` and `compressed-tensors` can lead to runtime errors or unexpected behavior.
- gotcha Many advanced compression techniques, especially certain quantization methods, are highly optimized for or require specific hardware (e.g., NVIDIA GPUs with CUDA). Running on CPU may lead to significantly slower performance or limited feature availability.
- gotcha For Post-Training Quantization (PTQ), the `compressor.compress()` method typically requires a representative calibration dataloader to collect statistics about activations. Omitting this can result in errors or poor quantization quality.
Install
-
pip install llmcompressor -
pip install llmcompressor[autoround]
Imports
- AutoModelForCausalLM
from llmcompressor.models import AutoModelForCausalLM
- QuantizationModifier
from llmcompressor.modifiers import QuantizationModifier
- SparseMLRecipe
from llmcompressor.recipes import SparseMLRecipe
- Compressor
from llmcompressor.compression import Compressor
- AutoTokenizer
from transformers import AutoTokenizer
Quickstart
from transformers import AutoTokenizer
from llmcompressor.models import AutoModelForCausalLM
from llmcompressor.recipes import SparseMLRecipe
from llmcompressor.compression import Compressor
import torch
# 1. Load a pre-trained model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# 2. Define a compression recipe (e.g., for 8-bit quantization)
# This YAML describes a simple post-training quantization (PTQ) modifier.
# For full functionality, specific targets and calibration data would be needed.
recipe_yaml = """
quantization_modifiers:
- !QuantizationModifier
start: 0.0
scheme_args:
num_bits: 8
symmetric: False
per_channel: True
"""
# 3. Parse the recipe
recipe = SparseMLRecipe.parse_yaml(recipe_yaml)
# 4. Create a Compressor instance
# The model will be modified in-place when compression is applied.
# For PTQ, a calibration dataloader is typically required for `compressor.compress()`.
compressor = Compressor(recipe=recipe, model=model, tokenizer=tokenizer)
# 5. Apply compression (requires calibration data for true PTQ)
print("Compressor initialized. To apply compression with Post-Training Quantization (PTQ),")
print("you would typically call: compressor.compress(dataloader=your_calibration_dataloader)")
print("For this quickstart, we've demonstrated the setup without running full PTQ.")
# Example of saving (after actual compression)
# compressor.save_compressed_model("path/to/save/compressed_model")