{"id":1400,"library":"bitsandbytes","title":"Bitsandbytes","description":"Bitsandbytes is a Python library that provides k-bit optimizers and matrix multiplication routines, primarily designed for making large language models (LLMs) more accessible through quantization in PyTorch. It focuses on dramatically reducing memory consumption for both inference and training via 8-bit and 4-bit quantization, including techniques like LLM.int8() and QLoRA. The library is actively maintained, currently at version 0.49.2, and frequently updated.","status":"active","version":"0.49.2","language":"en","source_language":"en","source_url":"https://github.com/bitsandbytes-foundation/bitsandbytes","tags":["quantization","LLM","PyTorch","GPU","deep-learning","memory-optimization","transformers"],"install":[{"cmd":"pip install bitsandbytes","lang":"bash","label":"Standard Installation"},{"cmd":"pip install bitsandbytes --prefer-binary --extra-index-url https://download.pytorch.org/whl/cu121","lang":"bash","label":"Installation with specific CUDA version (e.g., CUDA 12.1)"}],"dependencies":[{"reason":"Bitsandbytes is a lightweight wrapper around CUDA custom functions for PyTorch.","package":"torch"},{"reason":"Often used in conjunction for loading and quantizing large language models.","package":"transformers","optional":true},{"reason":"Commonly used for distributed training and model loading with transformers.","package":"accelerate","optional":true},{"reason":"Required for QLoRA fine-tuning and merging adapters.","package":"peft","optional":true}],"imports":[{"symbol":"bnb.nn.Linear8bitLt","correct":"import bitsandbytes as bnb\nfrom bitsandbytes.nn import Linear8bitLt"},{"symbol":"bnb.optim.Adam8bit","correct":"import bitsandbytes as bnb\nfrom bitsandbytes.optim import Adam8bit"},{"note":"Commonly used when integrating with Hugging Face Transformers for quantization.","symbol":"BitsAndBytesConfig","correct":"from transformers import BitsAndBytesConfig"}],"quickstart":{"code":"import torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\nimport os\n\n# NOTE: Replace with a small, accessible model for actual testing if needed.\n# For a quick runnable example without downloading a large model, this snippet\n# focuses on the setup. For full inference, a suitable model would be larger.\n# Using a small placeholder model for demonstration purposes.\n# In a real scenario, 'meta-llama/Llama-2-7b-hf' (or similar) would be used.\n\n# Configure 8-bit quantization\nbnb_config = BitsAndBytesConfig(\n    load_in_8bit=True,\n)\n\n# Load a dummy model with 8-bit quantization (replace with a real model for actual use)\n# This example is illustrative. For a real model, 'meta-llama/Llama-2-7b-hf' \n# requires authentication/access. Using a placeholder for direct runnability checks.\n# Set up a dummy class to simulate AutoModelForCausalLM for local testing if no GPU/model access\nclass DummyModel:\n    def __init__(self, config=None, **kwargs):\n        print(f\"Dummy model initialized with config: {config}, kwargs: {kwargs}\")\n        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'\n    def generate(self, *args, **kwargs):\n        print(f\"Dummy generate call with args: {args}, kwargs: {kwargs}\")\n        return torch.tensor([[1, 2, 3]]) # Placeholder output\n\nclass DummyTokenizer:\n    def __init__(self, *args, **kwargs):\n        pass\n    def __call__(self, text, return_tensors):\n        print(f\"Dummy tokenizer called with '{text}'\")\n        return {'input_ids': torch.tensor([[0, 1, 2, 3]])}\n    def decode(self, *args, **kwargs):\n        return \"dummy output\"\n\n# Use actual AutoModelForCausalLM and AutoTokenizer if `transformers` and a CUDA-enabled GPU are available\n# Otherwise, the dummy classes above will be used to allow the code to run.\n\nif torch.cuda.is_available():\n    try:\n        # Ensure you have access to a model like 'meta-llama/Llama-2-7b-hf' or similar\n        # and accept its terms of use on Hugging Face if using a restricted model.\n        # For a truly runnable quickstart *without* specific HF token or large download, \n        # consider a very small public model like 'hf-internal-testing/tiny-random-llama'\n        # but it might not fully demonstrate bnb benefits.\n        model_id = \"hf-internal-testing/tiny-random-llama\" # Publicly available tiny model\n        model = AutoModelForCausalLM.from_pretrained(\n            model_id,\n            device_map=\"auto\",\n            quantization_config=bnb_config,\n            torch_dtype=torch.float16 # Often beneficial with bnb\n        )\n        tokenizer = AutoTokenizer.from_pretrained(model_id)\n        print(\"Model loaded using transformers and bitsandbytes.\")\n\n        inputs = tokenizer(\"Hello, my name is\", return_tensors=\"pt\").to(model.device)\n        outputs = model.generate(**inputs, max_new_tokens=20)\n        print(\"Generated text (first 50 chars):\", tokenizer.decode(outputs[0])[:50])\n\n    except Exception as e:\n        print(f\"Could not load actual model: {e}. Using dummy classes instead.\")\n        model = DummyModel(quantization_config=bnb_config, device_map=\"auto\")\n        tokenizer = DummyTokenizer()\n        inputs = tokenizer(\"Hello, my name is\", return_tensors=\"pt\")\n        outputs = model.generate(**inputs, max_new_tokens=20)\n        print(\"Generated text (dummy):\", tokenizer.decode(outputs[0]))\nelse:\n    print(\"CUDA not available. Using dummy classes for demonstration.\")\n    model = DummyModel(quantization_config=bnb_config, device_map=\"auto\")\n    tokenizer = DummyTokenizer()\n    inputs = tokenizer(\"Hello, my name is\", return_tensors=\"pt\")\n    outputs = model.generate(**inputs, max_new_tokens=20)\n    print(\"Generated text (dummy):\", tokenizer.decode(outputs[0]))\n\nprint(\"Bitsandbytes integration quickstart completed.\")","lang":"python","description":"This quickstart demonstrates how to load a model using 8-bit quantization with `bitsandbytes` through the Hugging Face `transformers` library. It attempts to load a small, publicly available model if CUDA is detected, otherwise falls back to dummy classes to ensure the example is runnable for illustrating the `BitsAndBytesConfig` usage."},"warnings":[{"fix":"Upgrade Python to 3.10+ and PyTorch to 2.3.0+ to ensure compatibility.","message":"Bitsandbytes v0.49.2 requires Python >=3.10 and PyTorch >=2.3.0. Support for older Python (e.g., 3.8, 3.9) and PyTorch (<2.3.0) versions has been dropped in recent releases.","severity":"breaking","affected_versions":">=0.49.0"},{"fix":"Ensure your `peft` library version is 0.14.0 or newer when working with 8-bit models and adapters.","message":"PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to `peft>=0.14.0` due to internal changes in `bitsandbytes` from version 0.43.","severity":"breaking","affected_versions":">=0.43.0"},{"fix":"Check your GPU's compute capability. If you encounter issues, ensure your CUDA toolkit is compatible, or consider compiling `bitsandbytes` from source. Refer to the official GitHub for detailed compilation instructions for unsupported hardware.","message":"GPU compatibility can be an issue. Older NVIDIA GPUs (Maxwell, Pascal generations or compute capability < 7.0) might not be fully supported or may require compiling `bitsandbytes` from source with specific flags, or using pre-compiled unofficial DLLs. Official support is for NVIDIA GPUs with CUDA 11.8 - 13.0, Intel XPUs, and Intel Gaudis.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Verify CUDA toolkit installation and environment variables. Ensure PyTorch is installed with CUDA support. Reinstall `bitsandbytes` using the appropriate `pip install` command (e.g., with `--extra-index-url` for your CUDA version) to force a GPU-enabled build.","message":"The error message \"The installed version of bitsandbytes was compiled without GPU support\" indicates that a CPU-only version was installed, or there's an issue with the CUDA setup/detection.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Be aware of this minor output discrepancy if reproducing exact results from models quantized with older versions.","message":"After upgrading from `bitsandbytes` v0.42 to v0.43, models using 4-bit quantization may generate slightly different outputs (approximately up to the 2nd decimal place) due to a fix in the underlying code.","severity":"gotcha","affected_versions":">=0.43.0 (when upgrading from <0.43.0)"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}