Complete Guide to DETR: Object Detection with Transformers

# Panduan Lengkap DETR: Object Detection dengan Transformers DETR (DEtection TRansformer) adalah pendekatan revolusioner untuk object detection yang dikembangkan oleh Facebook AI Research. DETR mengg...

By Ruby Abdullah · · tutorial
PythonDETRObject DetectionTransformerComputer Vision

Complete Guide to DETR: Object Detection with Transformers

DETR (DEtection TRansformer) is a revolutionary approach to object detection developed by Facebook AI Research. DETR replaces complex traditional pipelines with an elegant end-to-end Transformer architecture.

In this tutorial, we'll learn DETR from basic concepts to practical implementation for object detection and segmentation.

Why DETR?

DETR Advantages:

  • End-to-End: No need for Non-Maximum Suppression (NMS) or anchor boxes
  • Simplicity: Inference requires only ~50 lines of PyTorch code
  • Performance: Matches Faster R-CNN with half the computation power
  • Versatile: Supports object detection and panoptic segmentation
  • Global Reasoning: Transformer enables global reasoning about object relationships
  • Parallel Prediction: Parallel predictions make DETR very fast
  • Comparison with Traditional Methods:

    | Aspect | Traditional (Faster R-CNN) | DETR |

    |--------|---------------------------|------|

    | Anchor Boxes | Yes | No |

    | NMS Post-processing | Yes | No |

    | Hand-crafted Components | Many | Minimal |

    | End-to-End Training | No | Yes |

    | Global Context | Limited | Full |

    DETR Architecture

    Main Components:

    Input Image → CNN Backbone → Transformer Encoder → Transformer Decoder → FFN → Predictions
    

    ↓ ↓ ↓

    Features Positional Object Queries

    Encoding (Learned)

    1. CNN Backbone

    # ResNet-50 or ResNet-101 as feature extractor
    

    Input: Image (3, H, W)

    Output: Feature map (2048, H/32, W/32)

    2. Transformer Encoder

    # Processes flattened feature map with self-attention
    

    Adds positional encoding

    Output: Encoded features with global context

    3. Transformer Decoder

    # Uses learned object queries (default: 100)
    

    Cross-attention with encoded features

    Self-attention between object queries

    Output: 100 embeddings for predictions

    4. Prediction Heads

    # FFN for class prediction
    

    FFN for bounding box prediction (centerx, centery, width, height)

    Special class: "no object" for queries that don't detect objects

    Installation

    Requirements

    # Clone repository
    

    git clone https://github.com/facebookresearch/detr.git

    cd detr

    Install dependencies

    conda create -n detr python=3.8

    conda activate detr

    PyTorch and Torchvision

    conda install pytorch torchvision cudatoolkit=11.3 -c pytorch

    Other dependencies

    conda install cython scipy

    pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

    For panoptic segmentation (optional)

    pip install git+https://github.com/cocodataset/panopticapi.git

    Verify Installation

    import torch
    

    import torchvision

    print(f"PyTorch version: {torch.version}")

    print(f"Torchvision version: {torchvision.version}")

    print(f"CUDA available: {torch.cuda.isavailable()}")

    Quick Start: Inference with Pretrained Model

    1. Load Model via PyTorch Hub

    import torch
    

    from PIL import Image

    import requests

    import matplotlib.pyplot as plt

    Load pretrained DETR model

    model = torch.hub.load('facebookresearch/detr:main', 'detrresnet50', pretrained=True)

    model.eval()

    COCO classes

    CLASSES = [

    'N/A', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',

    'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A',

    'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse',

    'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack',

    'umbrella', 'N/A', 'N/A', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis',

    'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',

    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'N/A', 'wine glass',

    Related Articles

    Florence-2: Microsoft's Multi-Task Vision Foundation Model

    Florence-2: Model Vision Multi-Task dari Microsoft Daftar Isi Pendahuluan Prasyarat Instalasi Memuat Model Florence-2

    Supervision: Computer Vision Toolkit by Roboflow

    Supervision: Toolkit Computer Vision dari Roboflow Dalam proyek computer vision, setelah model mendeteksi objek, Anda ma...

    Complete Ultralytics Tutorial: Object Detection with YOLO

    Tutorial Lengkap Ultralytics: Object Detection dengan YOLO Ultralytics adalah framework Python yang menyediakan implemen...

    OpenCV + Deep Learning Tutorial: Modern Image Processing with Python

    OpenCV + Deep Learning: Tutorial Komprehensif Daftar Isi Pendahuluan Prasyarat Dasar-Dasar Preprocessing Gambar [T...