Compressed Tensors
Compressed Tensors is a Python library designed for the efficient utilization and storage of compressed safetensors of neural network models. It provides tools for quantization, compression, and handling various compression schemes. The current version is 0.15.0, and the project maintains an active release cadence, frequently pushing minor updates and bug fixes.
Warnings
- breaking The `safe_permute` utility function was removed in version 0.12.2. Any code relying on this specific utility will break.
- gotcha The `accelerate` library is an optional dependency. Features requiring `accelerate` (e.g., specific offloading or distributed capabilities) will raise a `ModuleNotFoundError` if `accelerate` is not installed.
- gotcha Between versions 0.12.2 and 0.14.0, the project repository moved from `neuralmagic/compressed-tensors` to `vllm-project/compressed-tensors`. While import paths are generally stable, users should be aware of this change in project ownership for community, support, and future development tracking.
- gotcha Version 0.14.0.1 included a patch to fix 'bugs related to file writing'. Prior versions might have had issues with the integrity or correctness of saved compressed models or related artifacts.
- gotcha Version 0.12.0 introduced and then quickly reverted/re-applied a 'Refactor module / parameter matching logic'. This could indicate instability or subtle behavioral changes in how compression strategies target specific model layers/parameters for users upgrading through these versions.
Install
-
pip install compressed-tensors -
pip install 'compressed-tensors[accelerate]'
Imports
- CompressionConfig
from compressed_tensors.config import CompressionConfig
- dispatch_model
from compressed_tensors.dispatch import dispatch_model
- QuantizationScheme
from compressed_tensors.quantization import QuantizationScheme
- SparseGPT
from compressed_tensors.compressors import SparseGPT
Quickstart
import torch
from transformers import AutoModelForCausalLM
from compressed_tensors.config import CompressionConfig
from compressed_tensors.dispatch import dispatch_model
# 1. Define a simple model for demonstration
class DummyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear1 = torch.nn.Linear(10, 20)
self.linear2 = torch.nn.Linear(20, 10)
def forward(self, x):
return self.linear2(self.linear1(x))
model = DummyModel()
# For a real model, you'd load it like this (example using AutoModelForCausalLM):
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)
# 2. Create a CompressionConfig
compression_config = CompressionConfig(
quantization_scheme=None, # No quantization for this example
compressed_tensors_path="./compressed_model"
)
# 3. Dispatch the model (apply compression/quantization)
# For a simple compression, you might just save it.
# If using actual compression schemes, dispatch_model applies them.
# For a basic example, we will just demonstrate loading and saving
# In a real scenario, you'd define quantization_scheme and other parameters
# within CompressionConfig to actually compress the tensors.
print(f"Original model type: {type(model)}")
# Example of how dispatch_model would typically be used:
# from compressed_tensors.quantization import QuantizationScheme
# compression_config_quantized = CompressionConfig(
# quantization_scheme=QuantizationScheme(num_bits=8, quant_method="per_tensor")
# )
# compressed_model = dispatch_model(model, compression_config_quantized)
# For this basic example, we'll just show an identity operation
# or a basic save if compression_config had a path
# As 'dispatch_model' is typically used for actual compression/quantization
# let's simulate saving for demonstration without complex compression logic
# A more direct compression example usually involves a compressor:
# from compressed_tensors.compressors import SparseGPT
# compressor = SparseGPT()
# compressed_model_state_dict = compressor.compress(model.state_dict(), compression_config)
# print(f"Compressed model state dict keys: {compressed_model_state_dict.keys()}")
# Simplified output demonstration:
print("Model preparation complete.")
print("To apply actual compression, define 'quantization_scheme' in CompressionConfig.")
print(f"Compression config path: {compression_config.compressed_tensors_path}")