Unsloth
Unsloth is a library that enables 2-5X faster training, reinforcement learning, and finetuning of large language models (LLMs) on consumer GPUs, often reducing VRAM usage. As of version 2026.4.4, it continues to provide significant optimizations for Hugging Face Transformers models. Releases follow a calendar-based versioning scheme, indicating frequent updates.
Warnings
- gotcha Unsloth performance critically depends on matching your CUDA toolkit version (if using NVIDIA GPUs) with the correct `[cuXXX]` installation extra (e.g., `[cu121]` for CUDA 12.1). Mismatching can lead to significant performance degradation or runtime errors.
- gotcha Unsloth is primarily designed and optimized for GPU acceleration. While it might run on a CPU, performance will be severely degraded and practically unusable for meaningful LLM finetuning. Ensure you have a compatible NVIDIA GPU with sufficient VRAM.
- gotcha Unsloth often relies on specific versions of `transformers` and `peft` libraries for optimal compatibility and performance. New releases of these dependencies might introduce breaking changes or require updates to Unsloth itself.
- breaking The arguments and default values for `FastLanguageModel.from_pretrained` and `FastLanguageModel.get_peft_model` can change between major Unsloth releases, especially concerning quantization, LoRA configuration, and gradient checkpointing.
Install
-
pip install "unsloth[cu121]" --upgrade -
pip install "unsloth[cu118]" --upgrade -
pip install unsloth --upgrade
Imports
- FastLanguageModel
from unsloth import FastLanguageModel
- PatchModel
from unsloth import FastLanguageModel, PatchModel
Quickstart
from unsloth import FastLanguageModel
import torch
# 1. Load a pre-trained model and tokenizer
max_seq_length = 2048 # Max sequence length for your model
dtype = None # None for auto detection (bfloat16 preferred, float16 as fallback)
load_in_4bit = True # Use 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit", # or "mistralai/Mistral-7B-Instruct-v0.2"
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
# 2. Prepare the model for training (add LoRA adapters)
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA attention dimension
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha = 16, # Alpha for LoRA scaling
lora_dropout = 0.05, # Dropout for LoRA layers
bias = "none", # Only under the LoRA layers
use_gradient_checkpointing = "current-device",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
# 3. Define a simple dataset (for demonstration)
# In a real scenario, you would load your actual dataset here
# from datasets import Dataset
# dataset = Dataset.from_dict({"text": ["...", "..."]})
print("Model and tokenizer loaded and prepared for finetuning!")
print(f"Using dtype: {model.dtype}, Quantization: {model.quantization_method}")
# Further steps would involve data preparation and using Hugging Face Trainer