TRL

0.29.1 verified Tue May 12 auth: no python install: stale quickstart: stale

Hugging Face library for post-training LLMs: SFT, DPO, GRPO, PPO, reward modeling. Current version is 0.29.1 (Mar 2026). Requires Python >=3.10. Extremely high API churn — major parameter renames across versions. tokenizer= renamed to processing_class= in 0.12. Still pre-1.0 (Development Status: Pre-Alpha).

pip install trl

Warnings

breaking tokenizer= parameter deprecated since TRL 0.12 and being removed. All trainers (SFTTrainer, DPOTrainer, etc.) now use processing_class= instead. Passing tokenizer= raises DeprecationWarning now, TypeError in future release. ↓

fix Replace tokenizer=tokenizer with processing_class=tokenizer in all trainer constructors.

breaking Training args like max_seq_length, packing, dataset_text_field have moved from SFTTrainer constructor to SFTConfig. Passing them directly to SFTTrainer raises TypeError in recent versions. ↓

fix Use SFTConfig(max_length=512, ...) and pass as args=SFTConfig(...) to SFTTrainer.

breaking TRL has extremely high API churn — major parameter renames, removals, and behavioral changes in almost every minor release. Code written for 0.8 likely fails on 0.15+. Pin versions in production. ↓

fix Pin exact TRL version in requirements. Read the GitHub releases page before upgrading: https://github.com/huggingface/trl/releases

gotcha DPOTrainer with PEFT does not keep a separate reference model in memory — it disables the adapter to recover reference behavior. Passing ref_model= with a PEFT model wastes memory and raises a warning about sync_ref_model incompatibility. ↓

fix With PEFT models, omit ref_model entirely. DPOTrainer handles reference behavior automatically via adapter disable/enable.

gotcha GRPOTrainer reward_funcs must be a list of callables, not a single callable. Passing a single function raises TypeError. ↓

fix Wrap single reward function in a list: reward_funcs=[my_reward_fn].

gotcha SFTTrainer dataset format: conversational datasets (with 'messages' column) are handled differently from text datasets (with 'text' column). Mixing formats or using wrong column name causes silent empty-loss training. ↓

fix For chat format datasets use a 'messages' column with OpenAI-style message dicts. For plain text use a 'text' column. Check dataset_text_field in SFTConfig if using a custom column name.

breaking TRL's core dependency, `torch`, often lacks pre-built wheels for less common Python versions (e.g., Python 3.13) or non-glibc environments (e.g., Alpine Linux). This results in `pip` installation failures because `torch` cannot be resolved. ↓

fix Ensure your Python environment (version and OS distribution) has readily available pre-built `torch` wheels. Consider using a Python version officially supported by PyTorch (e.g., Python 3.10, 3.11) on a widely supported base OS (e.g., Debian/Ubuntu). If using Alpine, you may need to build `torch` from source or switch to a glibc-based image.

Install

pip install trl[peft]

pip install trl[quantization]

pip install trl[vllm]

Install compatibility stale last tested: 2026-05-12

python os / libc variant status wheel install import disk

3.10 alpine (musl) trl - - - -

3.10 alpine (musl) peft - - - -

3.10 alpine (musl) quantization - - - -

3.10 alpine (musl) vllm - - - -

3.10 slim (glibc) trl - - 17.16s 5.0G

3.10 slim (glibc) peft - - - -

3.10 slim (glibc) quantization - - - -

3.10 slim (glibc) vllm - - - -

3.11 alpine (musl) trl - - - -

3.11 alpine (musl) peft - - - -

3.11 alpine (musl) quantization - - - -

3.11 alpine (musl) vllm - - - -

3.11 slim (glibc) trl - - 22.34s 5.1G

3.11 slim (glibc) peft - - 22.70s 5.1G

3.11 slim (glibc) quantization - - 21.59s 5.3G

3.11 slim (glibc) vllm - - - -

3.12 alpine (musl) trl - - - -

3.12 alpine (musl) peft - - - -

3.12 alpine (musl) quantization - - - -

3.12 alpine (musl) vllm - - - -

3.12 slim (glibc) trl - - 24.18s 5.1G

3.12 slim (glibc) peft - - 25.37s 5.1G

3.12 slim (glibc) quantization - - 24.42s 5.2G

3.12 slim (glibc) vllm - - - -

3.13 alpine (musl) trl - - - -

3.13 alpine (musl) peft - - - -

3.13 alpine (musl) quantization - - - -

3.13 alpine (musl) vllm - - - -

3.13 slim (glibc) trl - - 21.36s 5.1G

3.13 slim (glibc) peft - - 21.91s 5.1G

3.13 slim (glibc) quantization - - 21.02s 5.2G

3.13 slim (glibc) vllm - - - -

3.9 alpine (musl) trl - - - -

3.9 alpine (musl) peft - - - -

3.9 alpine (musl) quantization - - - -

3.9 alpine (musl) vllm - - - -

3.9 slim (glibc) trl - - - -

3.9 slim (glibc) peft - - - -

3.9 slim (glibc) quantization - - - -

3.9 slim (glibc) vllm - - - -

Imports

SFTTrainer

wrong

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,  # deprecated since 0.12, removed in future release
    max_seq_length=512,   # moved to SFTConfig, not SFTTrainer directly
)

correct

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model='Qwen/Qwen2.5-0.5B',
    args=SFTConfig(output_dir='output', max_length=512),
    train_dataset=load_dataset('trl-lib/Capybara', split='train'),
    processing_class=tokenizer,  # not tokenizer=
)

tokenizer= parameter renamed to processing_class= in TRL 0.12. Training args like max_seq_length moved to SFTConfig, not passed directly to SFTTrainer.

DPOTrainer

wrong

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,  # unnecessary when using PEFT — causes extra memory usage
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,  # deprecated
)

correct

from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model=model,
    args=DPOConfig(output_dir='output', beta=0.1),
    train_dataset=dataset,
    processing_class=tokenizer,
    # With PEFT: no ref_model needed — adapter is disabled to recover reference behavior
)

When using PEFT with DPOTrainer, do not pass ref_model. TRL automatically recovers reference behavior by disabling the adapter. Passing ref_model with PEFT wastes memory and may conflict.

Quickstart stale last tested: 2026-04-23

SFT then DPO pipeline. Use SFTConfig/DPOConfig for all training args.

from datasets import load_dataset
from trl import SFTConfig, SFTTrainer

# SFT — minimal setup
trainer = SFTTrainer(
    model='Qwen/Qwen2.5-0.5B',
    args=SFTConfig(output_dir='sft_output', num_train_epochs=1),
    train_dataset=load_dataset('trl-lib/Capybara', split='train'),
)
trainer.train()

# DPO — after SFT
from trl import DPOConfig, DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('sft_output')
tokenizer = AutoTokenizer.from_pretrained('sft_output')

trainer = DPOTrainer(
    model=model,
    args=DPOConfig(output_dir='dpo_output', beta=0.1),
    train_dataset=load_dataset('trl-lib/ultrafeedback_binarized', split='train'),
    processing_class=tokenizer,
)
trainer.train()