LLM Post-Training with TRL: SFT, Reward Modeling, and DPO
After a base language model is pretrained, it still needs to be shaped into something useful and aligned with human expectations. TRL (Transformer Reinforcement Learning) is Hugging Face's library for that entire post-training stage, from supervised fine-tuning through preference optimization. This tutorial walks through the modern post-training pipeline and shows how to use TRL's trainers in practice, with Direct Preference Optimization (DPO) as the centerpiece.
What TRL Is and Where It Sits
TRL is a library focused on the steps that come after pretraining. Unlike a wrapper that mainly simplifies configuration, TRL provides concrete trainer classes for the full alignment stack: supervised fine-tuning, reward modeling, and several preference-optimization and reinforcement-learning algorithms.
It is built directly on the Hugging Face ecosystem:
- Transformers for model and tokenizer loading.
- PEFT for parameter-efficient adapters such as LoRA, so you can train large models on modest hardware.
- Accelerate for distributed and mixed-precision training.
- Datasets for loading and preprocessing data.
Because TRL reuses these components, anything you already know about AutoModelForCausalLM, tokenizers, or LoRA configs carries over directly.
The Modern LLM Post-Training Pipeline
A typical instruction-following or chat model goes through three stages:
TRL covers stages 2 and 3. Most production alignment work today combines an SFT pass followed by a preference-optimization pass.
Installation
# Core stack
pip install trl peft datasets accelerate
Recommended extras
pip install transformers bitsandbytes
pip install wandb # optional experiment tracking
Verify the install and check that a GPU is visible:
import torch
import trl
print("TRL version:", trl.version)
print("CUDA available:", torch.cuda.isavailable())
TRL trainers run on CPU for tiny experiments, but realistic post-training requires a GPU.
Dataset Formats in TRL
TRL standardizes on a few dataset shapes. Getting these right is most of the work.
- Conversational (chat) format for SFT: a
messagescolumn containing a list of{"role", "content"}dictionaries. - Preference format for reward modeling and DPO: columns named
prompt,chosen, andrejected.
The conversational format looks like this:
example = {
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what a vector database is."},
{"role": "assistant", "content": "A vector database stores embeddings..."},
]
}
The preference format used throughout the rest of this tutorial:
example = {
"prompt": "Explain what a vector database is.",
"chosen": "A vector database stores embeddings and supports similarity search...",
"rejected": "It's just a normal database.",
}
TRL applies the model's chat template automatically when a dataset is conversational, so you rarely format strings by hand.
Preparing Data with the datasets Library
In practice your raw data rarely arrives in the exact shape a trainer expects. The datasets library makes reshaping cheap, and the transformation runs lazily and in parallel.
from datasets import loaddataset
raw = loaddataset("json", datafiles="mypreferences.jsonl", split="train")
def topreference(example):
return {
"prompt": example["question"],
"chosen": example["goodanswer"],