Compressed Tensors

0.15.0 · active · verified Thu Apr 09

Compressed Tensors is a Python library designed for the efficient utilization and storage of compressed safetensors of neural network models. It provides tools for quantization, compression, and handling various compression schemes. The current version is 0.15.0, and the project maintains an active release cadence, frequently pushing minor updates and bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a `CompressionConfig` and highlights the typical flow for using `compressed-tensors` with a model. While `dispatch_model` is the entry point for applying compression, this example provides a simplified overview. For actual compression, define a `quantization_scheme` within `CompressionConfig`.

import torch
from transformers import AutoModelForCausalLM
from compressed_tensors.config import CompressionConfig
from compressed_tensors.dispatch import dispatch_model

# 1. Define a simple model for demonstration
class DummyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(10, 20)
        self.linear2 = torch.nn.Linear(20, 10)
    def forward(self, x):
        return self.linear2(self.linear1(x))

model = DummyModel()

# For a real model, you'd load it like this (example using AutoModelForCausalLM):
# model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)

# 2. Create a CompressionConfig
compression_config = CompressionConfig(
    quantization_scheme=None, # No quantization for this example
    compressed_tensors_path="./compressed_model"
)

# 3. Dispatch the model (apply compression/quantization)
# For a simple compression, you might just save it.
# If using actual compression schemes, dispatch_model applies them.

# For a basic example, we will just demonstrate loading and saving
# In a real scenario, you'd define quantization_scheme and other parameters
# within CompressionConfig to actually compress the tensors.

print(f"Original model type: {type(model)}")

# Example of how dispatch_model would typically be used:
# from compressed_tensors.quantization import QuantizationScheme
# compression_config_quantized = CompressionConfig(
#     quantization_scheme=QuantizationScheme(num_bits=8, quant_method="per_tensor")
# )
# compressed_model = dispatch_model(model, compression_config_quantized)

# For this basic example, we'll just show an identity operation
# or a basic save if compression_config had a path
# As 'dispatch_model' is typically used for actual compression/quantization
# let's simulate saving for demonstration without complex compression logic

# A more direct compression example usually involves a compressor:
# from compressed_tensors.compressors import SparseGPT
# compressor = SparseGPT()
# compressed_model_state_dict = compressor.compress(model.state_dict(), compression_config)
# print(f"Compressed model state dict keys: {compressed_model_state_dict.keys()}")

# Simplified output demonstration:
print("Model preparation complete.")
print("To apply actual compression, define 'quantization_scheme' in CompressionConfig.")
print(f"Compression config path: {compression_config.compressed_tensors_path}")

view raw JSON →