TRL
Hugging Face library for post-training LLMs: SFT, DPO, GRPO, PPO, reward modeling. Current version is 0.29.1 (Mar 2026). Requires Python >=3.10. Extremely high API churn — major parameter renames across versions. tokenizer= renamed to processing_class= in 0.12. Still pre-1.0 (Development Status: Pre-Alpha).
Warnings
- breaking tokenizer= parameter deprecated since TRL 0.12 and being removed. All trainers (SFTTrainer, DPOTrainer, etc.) now use processing_class= instead. Passing tokenizer= raises DeprecationWarning now, TypeError in future release.
- breaking Training args like max_seq_length, packing, dataset_text_field have moved from SFTTrainer constructor to SFTConfig. Passing them directly to SFTTrainer raises TypeError in recent versions.
- breaking TRL has extremely high API churn — major parameter renames, removals, and behavioral changes in almost every minor release. Code written for 0.8 likely fails on 0.15+. Pin versions in production.
- gotcha DPOTrainer with PEFT does not keep a separate reference model in memory — it disables the adapter to recover reference behavior. Passing ref_model= with a PEFT model wastes memory and raises a warning about sync_ref_model incompatibility.
- gotcha GRPOTrainer reward_funcs must be a list of callables, not a single callable. Passing a single function raises TypeError.
- gotcha SFTTrainer dataset format: conversational datasets (with 'messages' column) are handled differently from text datasets (with 'text' column). Mixing formats or using wrong column name causes silent empty-loss training.
Install
-
pip install trl -
pip install trl[peft] -
pip install trl[quantization] -
pip install trl[vllm]
Imports
- SFTTrainer
from trl import SFTConfig, SFTTrainer from datasets import load_dataset trainer = SFTTrainer( model='Qwen/Qwen2.5-0.5B', args=SFTConfig(output_dir='output', max_length=512), train_dataset=load_dataset('trl-lib/Capybara', split='train'), processing_class=tokenizer, # not tokenizer= ) - DPOTrainer
from trl import DPOConfig, DPOTrainer trainer = DPOTrainer( model=model, args=DPOConfig(output_dir='output', beta=0.1), train_dataset=dataset, processing_class=tokenizer, # With PEFT: no ref_model needed — adapter is disabled to recover reference behavior )
Quickstart
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
# SFT — minimal setup
trainer = SFTTrainer(
model='Qwen/Qwen2.5-0.5B',
args=SFTConfig(output_dir='sft_output', num_train_epochs=1),
train_dataset=load_dataset('trl-lib/Capybara', split='train'),
)
trainer.train()
# DPO — after SFT
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('sft_output')
tokenizer = AutoTokenizer.from_pretrained('sft_output')
trainer = DPOTrainer(
model=model,
args=DPOConfig(output_dir='dpo_output', beta=0.1),
train_dataset=load_dataset('trl-lib/ultrafeedback_binarized', split='train'),
processing_class=tokenizer,
)
trainer.train()