TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

# Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, model tersebut masih perlu dibentuk agar berguna dan selaras dengan ekspektasi...

By Ruby Abdullah · · tutorial
TRLLLMDPORLHFFine-TuningPython

LLM Post-Training with TRL: SFT, Reward Modeling, and DPO

After a base language model is pretrained, it still needs to be shaped into something useful and aligned with human expectations. TRL (Transformer Reinforcement Learning) is Hugging Face's library for that entire post-training stage, from supervised fine-tuning through preference optimization. This tutorial walks through the modern post-training pipeline and shows how to use TRL's trainers in practice, with Direct Preference Optimization (DPO) as the centerpiece.

What TRL Is and Where It Sits

TRL is a library focused on the steps that come after pretraining. Unlike a wrapper that mainly simplifies configuration, TRL provides concrete trainer classes for the full alignment stack: supervised fine-tuning, reward modeling, and several preference-optimization and reinforcement-learning algorithms.

It is built directly on the Hugging Face ecosystem:

  • Transformers for model and tokenizer loading.
  • PEFT for parameter-efficient adapters such as LoRA, so you can train large models on modest hardware.
  • Accelerate for distributed and mixed-precision training.
  • Datasets for loading and preprocessing data.

Because TRL reuses these components, anything you already know about AutoModelForCausalLM, tokenizers, or LoRA configs carries over directly.

The Modern LLM Post-Training Pipeline

A typical instruction-following or chat model goes through three stages:

  • Pretraining — next-token prediction on a very large, mostly unlabeled corpus. This is expensive and usually done by model providers, not end users.
  • Supervised Fine-Tuning (SFT) — train the model on curated prompt/response pairs so it learns to follow instructions and adopt a chat format.
  • Preference Optimization — adjust the model so its outputs match human (or AI) preferences. This can be done with a reward model plus reinforcement learning (RLHF/PPO), or more directly with DPO and related methods.
  • TRL covers stages 2 and 3. Most production alignment work today combines an SFT pass followed by a preference-optimization pass.

    Installation

    # Core stack
    

    pip install trl peft datasets accelerate

    Recommended extras

    pip install transformers bitsandbytes

    pip install wandb # optional experiment tracking

    Verify the install and check that a GPU is visible:

    import torch
    

    import trl

    print("TRL version:", trl.version)

    print("CUDA available:", torch.cuda.isavailable())

    TRL trainers run on CPU for tiny experiments, but realistic post-training requires a GPU.

    Dataset Formats in TRL

    TRL standardizes on a few dataset shapes. Getting these right is most of the work.

    • Conversational (chat) format for SFT: a messages column containing a list of {"role", "content"} dictionaries.
    • Preference format for reward modeling and DPO: columns named prompt, chosen, and rejected.

    The conversational format looks like this:

    example = {
    

    "messages": [

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "Explain what a vector database is."},

    {"role": "assistant", "content": "A vector database stores embeddings..."},

    ]

    }

    The preference format used throughout the rest of this tutorial:

    example = {
    

    "prompt": "Explain what a vector database is.",

    "chosen": "A vector database stores embeddings and supports similarity search...",

    "rejected": "It's just a normal database.",

    }

    TRL applies the model's chat template automatically when a dataset is conversational, so you rarely format strings by hand.

    Preparing Data with the datasets Library

    In practice your raw data rarely arrives in the exact shape a trainer expects. The datasets library makes reshaping cheap, and the transformation runs lazily and in parallel.

    from datasets import loaddataset
    
    

    raw = loaddataset("json", datafiles="mypreferences.jsonl", split="train")

    def topreference(example):

    return {

    "prompt": example["question"],

    "chosen": example["goodanswer"],

    Related Articles

    Axolotl Tutorial: Configuration-Driven LLM Fine-Tuning

    Fine-Tuning LLM Berbasis Konfigurasi dengan Axolotl Kebanyakan proyek fine-tuning dimulai dengan cara yang sama: seseora...

    Unsloth Tutorial: Fast and Memory-Efficient LLM Fine-Tuning

    Fine-Tuning LLM Secara Efisien dengan Unsloth Dahulu, melakukan fine-tuning model bahasa besar membutuhkan server multi-...

    PydanticAI Tutorial: A Type-Safe Agent Framework for LLM Apps

    Membangun Agen LLM yang Type-Safe dengan PydanticAI PydanticAI adalah framework agen dari tim di balik Pydantic, diranca...

    AutoGen: Microsoft's Multi-Agent Conversation Framework

    AutoGen: Framework Multi-Agent Conversation dari Microsoft AutoGen adalah framework open-source dari Microsoft Research ...