Complete Guide to DETR: Object Detection with Transformers
DETR (DEtection TRansformer) is a revolutionary approach to object detection developed by Facebook AI Research. DETR replaces complex traditional pipelines with an elegant end-to-end Transformer architecture.
In this tutorial, we'll learn DETR from basic concepts to practical implementation for object detection and segmentation.
Why DETR?
DETR Advantages:
Comparison with Traditional Methods:
| Aspect | Traditional (Faster R-CNN) | DETR |
|--------|---------------------------|------|
| Anchor Boxes | Yes | No |
| NMS Post-processing | Yes | No |
| Hand-crafted Components | Many | Minimal |
| End-to-End Training | No | Yes |
| Global Context | Limited | Full |
DETR Architecture
Main Components:
Input Image → CNN Backbone → Transformer Encoder → Transformer Decoder → FFN → Predictions
↓ ↓ ↓
Features Positional Object Queries
Encoding (Learned)
1. CNN Backbone
# ResNet-50 or ResNet-101 as feature extractor
Input: Image (3, H, W)
Output: Feature map (2048, H/32, W/32)
2. Transformer Encoder
# Processes flattened feature map with self-attention
Adds positional encoding
Output: Encoded features with global context
3. Transformer Decoder
# Uses learned object queries (default: 100)
Cross-attention with encoded features
Self-attention between object queries
Output: 100 embeddings for predictions
4. Prediction Heads
# FFN for class prediction
FFN for bounding box prediction (centerx, centery, width, height)
Special class: "no object" for queries that don't detect objects
Installation
Requirements
# Clone repository
git clone https://github.com/facebookresearch/detr.git
cd detr
Install dependencies
conda create -n detr python=3.8
conda activate detr
PyTorch and Torchvision
conda install pytorch torchvision cudatoolkit=11.3 -c pytorch
Other dependencies
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
For panoptic segmentation (optional)
pip install git+https://github.com/cocodataset/panopticapi.git
Verify Installation
import torch
import torchvision
print(f"PyTorch version: {torch.version}")
print(f"Torchvision version: {torchvision.version}")
print(f"CUDA available: {torch.cuda.isavailable()}")
Quick Start: Inference with Pretrained Model
1. Load Model via PyTorch Hub
import torch
from PIL import Image
import requests
import matplotlib.pyplot as plt
Load pretrained DETR model
model = torch.hub.load('facebookresearch/detr:main', 'detrresnet50', pretrained=True)
model.eval()
COCO classes
CLASSES = [
'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A',
'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',
'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack',
'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',
'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass',