Complete Tutorial: Fine-tuning LLMs with LoRA and PEFT
Fine-tuning Large Language Models (LLMs) traditionally requires massive GPU memory and long training times. LoRA (Low-Rank Adaptation) and PEFT (Parameter-Efficient Fine-Tuning) enable you to fine-tune with significantly fewer resources while maintaining competitive performance.
What are LoRA and PEFT?
LoRA (Low-Rank Adaptation)
LoRA is a technique that adds trainable low-rank matrices to frozen model layers. Instead of updating all model parameters:
- Freeze all original model weights
- Inject trainable rank decomposition matrices (A and B)
- Only train these new matrices (< 1% of total parameters)
- Memory efficient: ~10x smaller than full fine-tuning
- Faster training: Training is faster
- No inference latency: Weights can be merged
- Task switching: Swap LoRA adapters for different tasks
PEFT (Parameter-Efficient Fine-Tuning)
PEFT is a library from Hugging Face that provides various efficient fine-tuning methods:
- LoRA: Low-rank adaptation
- QLoRA: LoRA with quantization
- Prefix Tuning: Prepend trainable tokens
- Prompt Tuning: Learn soft prompts
- IA3: Infused Adapter by Inhibiting and Amplifying
Installation
# Install packages
pip install transformers datasets accelerate peft bitsandbytes
pip install trl # For training utilities
pip install wandb # Optional: experiment tracking
For QLoRA (4-bit quantization)
pip install bitsandbytes>=0.41.0
Verify GPU
python -c "import torch; print(torch.cuda.isavailable())"
Quick Start: Fine-tune with LoRA
1. Basic LoRA Fine-tuning
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, getpeftmodel, TaskType
from datasets import loaddataset
from trl import SFTTrainer
Load base model and tokenizer
modelname = "meta-llama/Llama-2-7b-hf" # Or another model
tokenizer = AutoTokenizer.frompretrained(modelname)
tokenizer.padtoken = tokenizer.eostoken
model = AutoModelForCausalLM.frompretrained(
modelname,
torchdtype="auto",
devicemap="auto"
)
LoRA configuration
loraconfig = LoraConfig(
r=8, # Rank of the update matrices
loraalpha=32, # Scaling factor
targetmodules=["qproj", "vproj"], # Modules to apply LoRA
loradropout=0.05,
bias="none",
tasktype=TaskType.CAUSALLM
)
Apply LoRA to model
model = getpeftmodel(model, loraconfig)
model.printtrainableparameters()
Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
Load dataset
dataset = loaddataset("timdettmers/openassistant-guanaco", split="train")
Training arguments
trainingargs = TrainingArguments(
outputdir="./lora-output",
numtrainepochs=3,
perdevicetrainbatchsize=4,
gradientaccumulationsteps=4,
learningrate=2e-4,
fp16=True,
loggingsteps=10,
savestrategy="epoch",
warmupratio=0.03,
)
Trainer
trainer = SFTTrainer(
model=model,
args=trainingargs,
traindataset=dataset,
tokenizer=tokenizer,
datasettextfield="text",
maxseqlength=512,
)
Train
trainer.train()
Save LoRA weights
model.savepretrained("./lora-adapter")
2. LoRA Config Parameters
from peft import LoraConfig
config = LoraConfig(
# Core parameters
r=8, # Rank: dimension of low-rank matrices
# Higher = more expressive but more params
# Typical: 4, 8, 16, 32, 64
loraalpha=32, # Scaling factor
# Rule: alpha = 2 r for optimal results
# Scaling = alpha / r
# Target modules (varies by model architecture)
targetmodules=[
"qproj", # Query projection