Bitsandbytes
Bitsandbytes is a Python library that provides k-bit optimizers and matrix multiplication routines, primarily designed for making large language models (LLMs) more accessible through quantization in PyTorch. It focuses on dramatically reducing memory consumption for both inference and training via 8-bit and 4-bit quantization, including techniques like LLM.int8() and QLoRA. The library is actively maintained, currently at version 0.49.2, and frequently updated.
Warnings
- breaking Bitsandbytes v0.49.2 requires Python >=3.10 and PyTorch >=2.3.0. Support for older Python (e.g., 3.8, 3.9) and PyTorch (<2.3.0) versions has been dropped in recent releases.
- breaking PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0` due to internal changes in `bitsandbytes` from version 0.43.
- gotcha GPU compatibility can be an issue. Older NVIDIA GPUs (Maxwell, Pascal generations or compute capability < 7.0) might not be fully supported or may require compiling `bitsandbytes` from source with specific flags, or using pre-compiled unofficial DLLs. Official support is for NVIDIA GPUs with CUDA 11.8 - 13.0, Intel XPUs, and Intel Gaudis.
- gotcha The error message "The installed version of bitsandbytes was compiled without GPU support" indicates that a CPU-only version was installed, or there's an issue with the CUDA setup/detection.
- gotcha After upgrading from `bitsandbytes` v0.42 to v0.43, models using 4-bit quantization may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the underlying code.
Install
-
pip install bitsandbytes -
pip install bitsandbytes --prefer-binary --extra-index-url https://download.pytorch.org/whl/cu121
Imports
- bnb.nn.Linear8bitLt
import bitsandbytes as bnb from bitsandbytes.nn import Linear8bitLt
- bnb.optim.Adam8bit
import bitsandbytes as bnb from bitsandbytes.optim import Adam8bit
- BitsAndBytesConfig
from transformers import BitsAndBytesConfig
Quickstart
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import os
# NOTE: Replace with a small, accessible model for actual testing if needed.
# For a quick runnable example without downloading a large model, this snippet
# focuses on the setup. For full inference, a suitable model would be larger.
# Using a small placeholder model for demonstration purposes.
# In a real scenario, 'meta-llama/Llama-2-7b-hf' (or similar) would be used.
# Configure 8-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
)
# Load a dummy model with 8-bit quantization (replace with a real model for actual use)
# This example is illustrative. For a real model, 'meta-llama/Llama-2-7b-hf'
# requires authentication/access. Using a placeholder for direct runnability checks.
# Set up a dummy class to simulate AutoModelForCausalLM for local testing if no GPU/model access
class DummyModel:
def __init__(self, config=None, **kwargs):
print(f"Dummy model initialized with config: {config}, kwargs: {kwargs}")
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
def generate(self, *args, **kwargs):
print(f"Dummy generate call with args: {args}, kwargs: {kwargs}")
return torch.tensor([[1, 2, 3]]) # Placeholder output
class DummyTokenizer:
def __init__(self, *args, **kwargs):
pass
def __call__(self, text, return_tensors):
print(f"Dummy tokenizer called with '{text}'")
return {'input_ids': torch.tensor([[0, 1, 2, 3]])}
def decode(self, *args, **kwargs):
return "dummy output"
# Use actual AutoModelForCausalLM and AutoTokenizer if `transformers` and a CUDA-enabled GPU are available
# Otherwise, the dummy classes above will be used to allow the code to run.
if torch.cuda.is_available():
try:
# Ensure you have access to a model like 'meta-llama/Llama-2-7b-hf' or similar
# and accept its terms of use on Hugging Face if using a restricted model.
# For a truly runnable quickstart *without* specific HF token or large download,
# consider a very small public model like 'hf-internal-testing/tiny-random-llama'
# but it might not fully demonstrate bnb benefits.
model_id = "hf-internal-testing/tiny-random-llama" # Publicly available tiny model
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=bnb_config,
torch_dtype=torch.float16 # Often beneficial with bnb
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print("Model loaded using transformers and bitsandbytes.")
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=20)
print("Generated text (first 50 chars):", tokenizer.decode(outputs[0])[:50])
except Exception as e:
print(f"Could not load actual model: {e}. Using dummy classes instead.")
model = DummyModel(quantization_config=bnb_config, device_map="auto")
tokenizer = DummyTokenizer()
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print("Generated text (dummy):", tokenizer.decode(outputs[0]))
else:
print("CUDA not available. Using dummy classes for demonstration.")
model = DummyModel(quantization_config=bnb_config, device_map="auto")
tokenizer = DummyTokenizer()
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print("Generated text (dummy):", tokenizer.decode(outputs[0]))
print("Bitsandbytes integration quickstart completed.")