LLM Post-Training with TRL: SFT, Reward Modeling, and DPO

After a base language model is pretrained, it still needs to be shaped into something useful and aligned with human expectations. TRL (Transformer Reinforcement Learning) is Hugging Face's library for that entire post-training stage, from supervised fine-tuning through preference optimization. This tutorial walks through the modern post-training pipeline and shows how to use TRL's trainers in practice, with Direct Preference Optimization (DPO) as the centerpiece.

What TRL Is and Where It Sits

TRL is a library focused on the steps that come after pretraining. Unlike a wrapper that mainly simplifies configuration, TRL provides concrete trainer classes for the full alignment stack: supervised fine-tuning, reward modeling, and several preference-optimization and reinforcement-learning algorithms.

It is built directly on the Hugging Face ecosystem:

Transformers for model and tokenizer loading.
PEFT for parameter-efficient adapters such as LoRA, so you can train large models on modest hardware.
Accelerate for distributed and mixed-precision training.
Datasets for loading and preprocessing data.

Because TRL reuses these components, anything you already know about AutoModelForCausalLM, tokenizers, or LoRA configs carries over directly.

The Modern LLM Post-Training Pipeline

A typical instruction-following or chat model goes through three stages:

Pretraining — next-token prediction on a very large, mostly unlabeled corpus. This is expensive and usually done by model providers, not end users.

Supervised Fine-Tuning (SFT) — train the model on curated prompt/response pairs so it learns to follow instructions and adopt a chat format.

Preference Optimization — adjust the model so its outputs match human (or AI) preferences. This can be done with a reward model plus reinforcement learning (RLHF/PPO), or more directly with DPO and related methods.

TRL covers stages 2 and 3. Most production alignment work today combines an SFT pass followed by a preference-optimization pass.

Installation

# Core stack pip install trl peft datasets accelerate Recommended extras pip install transformers bitsandbytes pip install wandb # optional experiment tracking

Verify the install and check that a GPU is visible:

import torch
import trl

print("TRL version:", trl.version)
print("CUDA available:", torch.cuda.isavailable())

TRL trainers run on CPU for tiny experiments, but realistic post-training requires a GPU.

Dataset Formats in TRL

TRL standardizes on a few dataset shapes. Getting these right is most of the work.

Conversational (chat) format for SFT: a messages column containing a list of {"role", "content"} dictionaries.

Preference format for reward modeling and DPO: columns named prompt, chosen, and rejected.

The conversational format looks like this:

example = { "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain what a vector database is."}, {"role": "assistant", "content": "A vector database stores embeddings..."}, ] }

The preference format used throughout the rest of this tutorial:

example = { "prompt": "Explain what a vector database is.", "chosen": "A vector database stores embeddings and supports similarity search...", "rejected": "It's just a normal database.", }

TRL applies the model's chat template automatically when a dataset is conversational, so you rarely format strings by hand.

Preparing Data with the datasets Library

In practice your raw data rarely arrives in the exact shape a trainer expects. The datasets library makes reshaping cheap, and the transformation runs lazily and in parallel.

from datasets import loaddataset

raw = loaddataset("json", datafiles="mypreferences.jsonl", split="train")


def topreference(example):
    return {
        "prompt": example["question"],
        "chosen": example["goodanswer"],

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

LLM Post-Training with TRL: SFT, Reward Modeling, and DPO

What TRL Is and Where It Sits

The Modern LLM Post-Training Pipeline

Installation

Recommended extras

Dataset Formats in TRL

Preparing Data with the `datasets` Library

Related Articles

Axolotl Tutorial: Configuration-Driven LLM Fine-Tuning

Unsloth Tutorial: Fast and Memory-Efficient LLM Fine-Tuning

PydanticAI Tutorial: A Type-Safe Agent Framework for LLM Apps

AutoGen: Microsoft's Multi-Agent Conversation Framework

Related Articles

Axolotl Tutorial: Configuration-Driven LLM Fine-Tuning

Fine-Tuning LLM Berbasis Konfigurasi dengan Axolotl Kebanyakan proyek fine-tuning dimulai dengan cara yang sama: seseora...

Unsloth Tutorial: Fast and Memory-Efficient LLM Fine-Tuning

Fine-Tuning LLM Secara Efisien dengan Unsloth Dahulu, melakukan fine-tuning model bahasa besar membutuhkan server multi-...

PydanticAI Tutorial: A Type-Safe Agent Framework for LLM Apps

Membangun Agen LLM yang Type-Safe dengan PydanticAI PydanticAI adalah framework agen dari tim di balik Pydantic, diranca...

AutoGen: Microsoft's Multi-Agent Conversation Framework

AutoGen: Framework Multi-Agent Conversation dari Microsoft AutoGen adalah framework open-source dari Microsoft Research ...

LLM Post-Training with TRL: SFT, Reward Modeling, and DPO

What TRL Is and Where It Sits

The Modern LLM Post-Training Pipeline

Installation

Recommended extras

Dataset Formats in TRL

Preparing Data with the datasets Library

Related Articles

Axolotl Tutorial: Configuration-Driven LLM Fine-Tuning

Unsloth Tutorial: Fast and Memory-Efficient LLM Fine-Tuning

PydanticAI Tutorial: A Type-Safe Agent Framework for LLM Apps

AutoGen: Microsoft's Multi-Agent Conversation Framework

Preparing Data with the `datasets` Library