Stable Diffusion: A Comprehensive Tutorial
Table of Contents
Introduction
Stable Diffusion is a latent diffusion model that generates high-quality images from text descriptions. Unlike earlier diffusion models that operated directly in pixel space, Stable Diffusion works in a compressed latent space, making it significantly faster and more memory-efficient. This tutorial covers everything from basic text-to-image generation to advanced techniques like ControlNet and DreamBooth fine-tuning.
The Hugging Face Diffusers library provides the most convenient interface for working with Stable Diffusion models, and this tutorial uses it extensively.
Prerequisites
# Install required packages
pip install diffusers transformers accelerate torch torchvision
pip install safetensors xformers
pip install Pillow numpy
pip install peft # For fine-tuning with LoRA/DreamBooth
System requirements:
- Python 3.9 or higher
- NVIDIA GPU with at least 8 GB VRAM (16 GB recommended for fine-tuning)
- CUDA 11.8 or higher
- At least 16 GB system RAM
Verify your setup:
import torch
import diffusers
print(f"PyTorch version: {torch.version}")
print(f"CUDA available: {torch.cuda.isavailable()}")
print(f"GPU: {torch.cuda.getdevicename(0) if torch.cuda.isavailable() else 'N/A'}")
print(f"Diffusers version: {diffusers.version}")
Understanding Stable Diffusion Architecture
Stable Diffusion consists of three main components:
The generation process works as follows: start with random noise in latent space, then iteratively denoise it using the U-Net, guided by the text conditioning from CLIP.
Text-to-Image Generation
Basic Generation
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
def createpipeline(modelid="stabilityai/stable-diffusion-2-1"):
"""Initialize the Stable Diffusion pipeline."""
pipe = StableDiffusionPipeline.frompretrained(
modelid,
torchdtype=torch.float16,
safetychecker=None,
)
pipe.scheduler = DPMSolverMultistepScheduler.fromconfig(pipe.scheduler.config)
pipe = pipe.to("cuda")
# Enable memory optimizations
pipe.enableattentionslicing()
# pipe.enablexformersmemoryefficientattention() # If xformers installed
return pipe
pipe = createpipeline()
Generate an image
prompt = "A serene mountain landscape at sunset, photorealistic, 8k resolution"
negativeprompt = "blurry, low quality, distorted, deformed"
image = pipe(
prompt=prompt,
negativeprompt=negativeprompt,
numinferencesteps=30,
guidancescale=7.5,
width=768,
height=768,
).images[0]
image.save("mountainsunset.png")
Batch Generation with Different Seeds
import torch
def generatevariations(pipe, prompt, negativeprompt="", numimages=4, seedstart=42):