TRL
raw JSON → 0.29.1 verified Tue May 12 auth: no python install: stale quickstart: stale
Hugging Face library for post-training LLMs: SFT, DPO, GRPO, PPO, reward modeling. Current version is 0.29.1 (Mar 2026). Requires Python >=3.10. Extremely high API churn — major parameter renames across versions. tokenizer= renamed to processing_class= in 0.12. Still pre-1.0 (Development Status: Pre-Alpha).
pip install trl Warnings
breaking tokenizer= parameter deprecated since TRL 0.12 and being removed. All trainers (SFTTrainer, DPOTrainer, etc.) now use processing_class= instead. Passing tokenizer= raises DeprecationWarning now, TypeError in future release. ↓
fix Replace tokenizer=tokenizer with processing_class=tokenizer in all trainer constructors.
breaking Training args like max_seq_length, packing, dataset_text_field have moved from SFTTrainer constructor to SFTConfig. Passing them directly to SFTTrainer raises TypeError in recent versions. ↓
fix Use SFTConfig(max_length=512, ...) and pass as args=SFTConfig(...) to SFTTrainer.
breaking TRL has extremely high API churn — major parameter renames, removals, and behavioral changes in almost every minor release. Code written for 0.8 likely fails on 0.15+. Pin versions in production. ↓
fix Pin exact TRL version in requirements. Read the GitHub releases page before upgrading: https://github.com/huggingface/trl/releases
gotcha DPOTrainer with PEFT does not keep a separate reference model in memory — it disables the adapter to recover reference behavior. Passing ref_model= with a PEFT model wastes memory and raises a warning about sync_ref_model incompatibility. ↓
fix With PEFT models, omit ref_model entirely. DPOTrainer handles reference behavior automatically via adapter disable/enable.
gotcha GRPOTrainer reward_funcs must be a list of callables, not a single callable. Passing a single function raises TypeError. ↓
fix Wrap single reward function in a list: reward_funcs=[my_reward_fn].
gotcha SFTTrainer dataset format: conversational datasets (with 'messages' column) are handled differently from text datasets (with 'text' column). Mixing formats or using wrong column name causes silent empty-loss training. ↓
fix For chat format datasets use a 'messages' column with OpenAI-style message dicts. For plain text use a 'text' column. Check dataset_text_field in SFTConfig if using a custom column name.
breaking TRL's core dependency, `torch`, often lacks pre-built wheels for less common Python versions (e.g., Python 3.13) or non-glibc environments (e.g., Alpine Linux). This results in `pip` installation failures because `torch` cannot be resolved. ↓
fix Ensure your Python environment (version and OS distribution) has readily available pre-built `torch` wheels. Consider using a Python version officially supported by PyTorch (e.g., Python 3.10, 3.11) on a widely supported base OS (e.g., Debian/Ubuntu). If using Alpine, you may need to build `torch` from source or switch to a glibc-based image.
Install
pip install trl[peft] pip install trl[quantization] pip install trl[vllm] Install compatibility stale last tested: 2026-05-12
python os / libc variant status wheel install import disk
3.10 alpine (musl) trl - - - -
3.10 alpine (musl) peft - - - -
3.10 alpine (musl) quantization - - - -
3.10 alpine (musl) vllm - - - -
3.10 slim (glibc) trl - - 17.16s 5.0G
3.10 slim (glibc) peft - - - -
3.10 slim (glibc) quantization - - - -
3.10 slim (glibc) vllm - - - -
3.11 alpine (musl) trl - - - -
3.11 alpine (musl) peft - - - -
3.11 alpine (musl) quantization - - - -
3.11 alpine (musl) vllm - - - -
3.11 slim (glibc) trl - - 22.34s 5.1G
3.11 slim (glibc) peft - - 22.70s 5.1G
3.11 slim (glibc) quantization - - 21.59s 5.3G
3.11 slim (glibc) vllm - - - -
3.12 alpine (musl) trl - - - -
3.12 alpine (musl) peft - - - -
3.12 alpine (musl) quantization - - - -
3.12 alpine (musl) vllm - - - -
3.12 slim (glibc) trl - - 24.18s 5.1G
3.12 slim (glibc) peft - - 25.37s 5.1G
3.12 slim (glibc) quantization - - 24.42s 5.2G
3.12 slim (glibc) vllm - - - -
3.13 alpine (musl) trl - - - -
3.13 alpine (musl) peft - - - -
3.13 alpine (musl) quantization - - - -
3.13 alpine (musl) vllm - - - -
3.13 slim (glibc) trl - - 21.36s 5.1G
3.13 slim (glibc) peft - - 21.91s 5.1G
3.13 slim (glibc) quantization - - 21.02s 5.2G
3.13 slim (glibc) vllm - - - -
3.9 alpine (musl) trl - - - -
3.9 alpine (musl) peft - - - -
3.9 alpine (musl) quantization - - - -
3.9 alpine (musl) vllm - - - -
3.9 slim (glibc) trl - - - -
3.9 slim (glibc) peft - - - -
3.9 slim (glibc) quantization - - - -
3.9 slim (glibc) vllm - - - -
Imports
- SFTTrainer wrong
trainer = SFTTrainer( model=model, train_dataset=dataset, tokenizer=tokenizer, # deprecated since 0.12, removed in future release max_seq_length=512, # moved to SFTConfig, not SFTTrainer directly )correctfrom trl import SFTConfig, SFTTrainer from datasets import load_dataset trainer = SFTTrainer( model='Qwen/Qwen2.5-0.5B', args=SFTConfig(output_dir='output', max_length=512), train_dataset=load_dataset('trl-lib/Capybara', split='train'), processing_class=tokenizer, # not tokenizer= ) - DPOTrainer wrong
trainer = DPOTrainer( model=model, ref_model=ref_model, # unnecessary when using PEFT — causes extra memory usage args=training_args, train_dataset=dataset, tokenizer=tokenizer, # deprecated )correctfrom trl import DPOConfig, DPOTrainer trainer = DPOTrainer( model=model, args=DPOConfig(output_dir='output', beta=0.1), train_dataset=dataset, processing_class=tokenizer, # With PEFT: no ref_model needed — adapter is disabled to recover reference behavior )
Quickstart stale last tested: 2026-04-23
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
# SFT — minimal setup
trainer = SFTTrainer(
model='Qwen/Qwen2.5-0.5B',
args=SFTConfig(output_dir='sft_output', num_train_epochs=1),
train_dataset=load_dataset('trl-lib/Capybara', split='train'),
)
trainer.train()
# DPO — after SFT
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('sft_output')
tokenizer = AutoTokenizer.from_pretrained('sft_output')
trainer = DPOTrainer(
model=model,
args=DPOConfig(output_dir='dpo_output', beta=0.1),
train_dataset=load_dataset('trl-lib/ultrafeedback_binarized', split='train'),
processing_class=tokenizer,
)
trainer.train()