{"id":9899,"library":"llmcompressor","title":"LLM Compressor","description":"LLM Compressor (current version 0.10.0.1) is a Python library for compressing large language models, offering both training-aware and post-training techniques. Built on PyTorch and HuggingFace Transformers, it provides a flexible and user-friendly interface for researchers and practitioners to quickly experiment with techniques like quantization and sparsity. The library maintains an active development pace with frequent patch releases and regular feature updates.","status":"active","version":"0.10.0.1","language":"en","source_language":"en","source_url":"https://github.com/vllm-project/llm-compressor","tags":["LLM","compression","quantization","sparsity","transformers","pytorch","deep learning","model optimization"],"install":[{"cmd":"pip install llmcompressor","lang":"bash","label":"Base Installation"},{"cmd":"pip install llmcompressor[autoround]","lang":"bash","label":"With AutoRound (x86_64 only)"}],"dependencies":[{"reason":"Core deep learning framework dependency.","package":"torch"},{"reason":"Integration with HuggingFace models and utilities.","package":"transformers"},{"reason":"Underlying core compression framework logic.","package":"sparseml.core"},{"reason":"PyTorch-specific components for compression.","package":"sparseml.pytorch"},{"reason":"Optimized tensor representations for compressed models, frequently updated.","package":"compressed-tensors","optional":false},{"reason":"Advanced rounding-based quantization algorithm. Requires x86_64.","package":"autoround","optional":true}],"imports":[{"symbol":"AutoModelForCausalLM","correct":"from llmcompressor.models import AutoModelForCausalLM"},{"symbol":"QuantizationModifier","correct":"from llmcompressor.modifiers import QuantizationModifier"},{"symbol":"SparseMLRecipe","correct":"from llmcompressor.recipes import SparseMLRecipe"},{"symbol":"Compressor","correct":"from llmcompressor.compression import Compressor"},{"note":"Required for model tokenization, part of HuggingFace Transformers.","symbol":"AutoTokenizer","correct":"from transformers import AutoTokenizer"}],"quickstart":{"code":"from transformers import AutoTokenizer\nfrom llmcompressor.models import AutoModelForCausalLM\nfrom llmcompressor.recipes import SparseMLRecipe\nfrom llmcompressor.compression import Compressor\nimport torch\n\n# 1. Load a pre-trained model and tokenizer\nmodel_name = \"TinyLlama/TinyLlama-1.1B-Chat-v1.0\"\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\n\n# 2. Define a compression recipe (e.g., for 8-bit quantization)\n# This YAML describes a simple post-training quantization (PTQ) modifier.\n# For full functionality, specific targets and calibration data would be needed.\nrecipe_yaml = \"\"\"\nquantization_modifiers:\n  - !QuantizationModifier\n    start: 0.0\n    scheme_args:\n      num_bits: 8\n      symmetric: False\n      per_channel: True\n\"\"\"\n\n# 3. Parse the recipe\nrecipe = SparseMLRecipe.parse_yaml(recipe_yaml)\n\n# 4. Create a Compressor instance\n# The model will be modified in-place when compression is applied.\n# For PTQ, a calibration dataloader is typically required for `compressor.compress()`.\ncompressor = Compressor(recipe=recipe, model=model, tokenizer=tokenizer)\n\n# 5. Apply compression (requires calibration data for true PTQ)\nprint(\"Compressor initialized. To apply compression with Post-Training Quantization (PTQ),\")\nprint(\"you would typically call: compressor.compress(dataloader=your_calibration_dataloader)\")\nprint(\"For this quickstart, we've demonstrated the setup without running full PTQ.\")\n\n# Example of saving (after actual compression)\n# compressor.save_compressed_model(\"path/to/save/compressed_model\")","lang":"python","description":"This quickstart demonstrates how to initialize `llmcompressor` for a simple post-training quantization (PTQ) workflow. It involves loading a Hugging Face model, defining a compression recipe in YAML, and setting up the `Compressor`. For actual PTQ, a calibration dataloader is required when calling `compressor.compress()`."},"warnings":[{"fix":"Always ensure `llmcompressor` and its dependencies are up-to-date, or explicitly install compatible versions. Refer to the GitHub releases for specific `compressed-tensors` versions used in each `llmcompressor` release.","message":"LLM Compressor frequently updates its dependency on `compressed-tensors`. Mismatched versions between `llmcompressor` and `compressed-tensors` can lead to runtime errors or unexpected behavior.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure you have a CUDA-enabled GPU and a PyTorch installation compiled with CUDA support for optimal performance. Check documentation for specific hardware requirements for desired modifiers.","message":"Many advanced compression techniques, especially certain quantization methods, are highly optimized for or require specific hardware (e.g., NVIDIA GPUs with CUDA). Running on CPU may lead to significantly slower performance or limited feature availability.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always provide a `dataloader` argument to `compressor.compress()` when performing PTQ to enable proper calibration. Refer to examples for how to prepare a calibration dataloader.","message":"For Post-Training Quantization (PTQ), the `compressor.compress()` method typically requires a representative calibration dataloader to collect statistics about activations. Omitting this can result in errors or poor quantization quality.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Install `llmcompressor` with the optional `autoround` dependency using `pip install llmcompressor[autoround]`.","cause":"The `autoround` library is an optional dependency and was not installed with the base `llmcompressor` package.","error":"ModuleNotFoundError: No module named 'autoround'"},{"fix":"Verify that your system has a CUDA-enabled GPU, PyTorch is installed with CUDA support, and your code correctly assigns models/tensors to an available device (e.g., `model.to(\"cuda\")`). For CPU-only environments, ensure operations are explicitly on CPU.","cause":"Attempting to perform GPU-accelerated operations on a system without a properly configured CUDA-enabled GPU, or trying to use a device index that does not exist.","error":"RuntimeError: CUDA error: invalid device ordinal"},{"fix":"Check for typos in the modifier name. Ensure your `llmcompressor` library is up-to-date (`pip install --upgrade llmcompressor`). Verify the exact class name and module path in the official documentation if you are creating custom recipes.","cause":"The specified modifier class in the YAML recipe cannot be found or imported. This often indicates a typo, an outdated recipe format, or an older library version that doesn't support the modifier.","error":"ValueError: Could not parse recipe YAML: Unknown modifier 'QuantizationModifier'"}]}