Unsloth Tutorial: Fast and Memory-Efficient LLM Fine-Tuning

# Fine-Tuning LLM Secara Efisien dengan Unsloth Dahulu, melakukan fine-tuning model bahasa besar membutuhkan server multi-GPU yang mahal dan waktu tunggu berjam-jam. Unsloth mengubah persamaan itu de...

By Ruby Abdullah · · tutorial
UnslothLLMFine-TuningLoRAQLoRAPython

Fine-Tuning LLMs Efficiently with Unsloth

Fine-tuning large language models used to demand expensive multi-GPU servers and hours of patience. Unsloth changes that equation by rewriting the hot paths of the training loop, letting you fine-tune models like Llama, Mistral, Qwen, Gemma, and Phi roughly twice as fast and with substantially less VRAM. In this tutorial we walk through a complete, realistic workflow: installing Unsloth, loading a 4-bit model, attaching LoRA adapters, preparing a dataset, training with Hugging Face's SFTTrainer, running inference, and exporting the result for production use.

What Is Unsloth and Why It Is Faster

Unsloth is an open-source library that accelerates the supervised fine-tuning (SFT) of transformer language models. It is not a new training algorithm and it does not change the math of your model. Instead, it replaces several performance-critical operations with hand-written implementations that do the same work more efficiently.

The speedups come from a few concrete engineering decisions:

  • Custom Triton kernels. Operations such as RoPE embeddings, RMSNorm, cross-entropy loss, and the LoRA matrix multiplications are reimplemented as fused Triton kernels. Fusing several steps into one kernel avoids repeated reads and writes to GPU memory, which is usually the real bottleneck.
  • Manual autograd. Rather than relying entirely on PyTorch's automatic differentiation graph, Unsloth provides hand-derived backward passes for the operations it owns. This removes intermediate tensors that PyTorch would otherwise keep around, lowering peak memory.
  • Memory-aware design. Unsloth integrates 4-bit quantization (QLoRA) and a custom gradient checkpointing strategy so that long sequences and larger models fit on a single consumer GPU.

A point worth stressing: these are exact reimplementations, not approximations. Unsloth does not trade accuracy for speed. The loss curve you get should match a standard Hugging Face plus PEFT run, just produced faster and with a smaller memory footprint.

What Unsloth Supports

Unsloth targets LoRA and QLoRA fine-tuning of the popular open-weight families:

  • Llama (3, 3.1, 3.2 and derivatives)
  • Mistral and Mixtral
  • Qwen (2 and 2.5)
  • Gemma (1 and 2)
  • Phi (3 and 3.5)

It works on a single NVIDIA GPU, including the free T4 offered in Google Colab. Full fine-tuning of every parameter is generally outside its sweet spot; the library is built around parameter-efficient methods.

Installation

For most recent setups, a single pip command is enough:

pip install unsloth

Unsloth depends on a CUDA-enabled PyTorch build, transformers, trl, peft, accelerate, and bitsandbytes. If you are on a clean machine, install a PyTorch build matching your CUDA version first, then install Unsloth:

# Example for CUDA 12.1 (adjust for your environment)

pip install torch --index-url https://download.pytorch.org/whl/cu121

pip install unsloth

To pull the latest fixes directly from the repository:

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

You can confirm the install and inspect your GPU with:

import torch

from unsloth import FastLanguageModel

print("CUDA available:", torch.cuda.isavailable())

print("Device:", torch.cuda.getdevicename(0) if torch.cuda.isavailable() else "CPU")

Loading a 4-Bit Model

The entry point for almost everything is FastLanguageModel.frompretrained. Loading in 4-bit (QLoRA) is what keeps VRAM low. Here we load a small instruct model that fits comfortably on a single GPU.

from unsloth import FastLanguageModel

import torch

maxseqlength = 2048 # Context length you intend to train on

dtype = None # None lets Unsloth auto-detect (bf16 on Ampere+, fp16 otherwise)

loadin4bit = True # 4-bit QLoRA; set False for 16-bit LoRA if you have the VRAM

model, tokenizer = FastLanguageModel.frompretrained(

modelname="unsloth/llama-3.2-3b-instruct-bnb-4bit",

maxseqlength=maxseqlength,

dtype=dtype,

loadin4bit=loadin4bit,

)

A few notes on these arguments:

Related Articles

Axolotl Tutorial: Configuration-Driven LLM Fine-Tuning

Fine-Tuning LLM Berbasis Konfigurasi dengan Axolotl Kebanyakan proyek fine-tuning dimulai dengan cara yang sama: seseora...

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...

Complete Tutorial: Fine-tuning LLMs with LoRA and PEFT

Tutorial Lengkap Fine-tuning LLM dengan LoRA dan PEFT Fine-tuning Large Language Models (LLM) secara tradisional membutu...

PydanticAI Tutorial: A Type-Safe Agent Framework for LLM Apps

Membangun Agen LLM yang Type-Safe dengan PydanticAI PydanticAI adalah framework agen dari tim di balik Pydantic, diranca...