LLM Compressor

0.10.0.1 · active · verified Fri Apr 17

LLM Compressor (current version 0.10.0.1) is a Python library for compressing large language models, offering both training-aware and post-training techniques. Built on PyTorch and HuggingFace Transformers, it provides a flexible and user-friendly interface for researchers and practitioners to quickly experiment with techniques like quantization and sparsity. The library maintains an active development pace with frequent patch releases and regular feature updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize `llmcompressor` for a simple post-training quantization (PTQ) workflow. It involves loading a Hugging Face model, defining a compression recipe in YAML, and setting up the `Compressor`. For actual PTQ, a calibration dataloader is required when calling `compressor.compress()`.

from transformers import AutoTokenizer
from llmcompressor.models import AutoModelForCausalLM
from llmcompressor.recipes import SparseMLRecipe
from llmcompressor.compression import Compressor
import torch

# 1. Load a pre-trained model and tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# 2. Define a compression recipe (e.g., for 8-bit quantization)
# This YAML describes a simple post-training quantization (PTQ) modifier.
# For full functionality, specific targets and calibration data would be needed.
recipe_yaml = """
quantization_modifiers:
  - !QuantizationModifier
    start: 0.0
    scheme_args:
      num_bits: 8
      symmetric: False
      per_channel: True
"""

# 3. Parse the recipe
recipe = SparseMLRecipe.parse_yaml(recipe_yaml)

# 4. Create a Compressor instance
# The model will be modified in-place when compression is applied.
# For PTQ, a calibration dataloader is typically required for `compressor.compress()`.
compressor = Compressor(recipe=recipe, model=model, tokenizer=tokenizer)

# 5. Apply compression (requires calibration data for true PTQ)
print("Compressor initialized. To apply compression with Post-Training Quantization (PTQ),")
print("you would typically call: compressor.compress(dataloader=your_calibration_dataloader)")
print("For this quickstart, we've demonstrated the setup without running full PTQ.")

# Example of saving (after actual compression)
# compressor.save_compressed_model("path/to/save/compressed_model")

view raw JSON →