Segment Anything (SAM): A Comprehensive Tutorial

Introduction

Prerequisites

Understanding SAM Architecture

Installation and Setup

Point-based Prompting

Box-based Prompting

Text-based Prompting with Grounding

Automatic Mask Generation

Integration with Other Models

Introduction

Segment Anything Model (SAM), developed by Meta AI, is a foundational model for image segmentation. It introduces zero-shot segmentation capability, meaning it can segment any object in any image without being specifically trained on that object class. SAM was trained on the SA-1B dataset containing over 1 billion masks from 11 million images, making it one of the most versatile segmentation models available.

This tutorial provides a comprehensive guide to using SAM, from basic prompting to advanced integration, fine-tuning, and production deployment.

Prerequisites

pip install segment-anything pip install torch torchvision pip install opencv-python numpy matplotlib pip install Pillow pip install onnxruntime # For ONNX-based deployment

System requirements:

Python 3.8 or higher
GPU with at least 8 GB VRAM (for ViT-H model; smaller models need less)
CUDA 11.7 or higher

Download model checkpoints:

# ViT-H (default, highest quality) - 2.4 GB wget https://dl.fbaipublicfiles.com/segmentanything/samvith4b8939.pth ViT-L (large) - 1.2 GB wget https://dl.fbaipublicfiles.com/segmentanything/samvitl0b3195.pth ViT-B (base, smallest) - 375 MB wget https://dl.fbaipublicfiles.com/segmentanything/samvitb01ec64.pth

Understanding SAM Architecture

SAM consists of three components:

Image Encoder: A Vision Transformer (ViT) that produces image embeddings. The image is processed once, and the embeddings are reused for all prompts. This is the most computationally expensive step.

Prompt Encoder: Encodes the user-provided prompts (points, boxes, masks, or text). Points and boxes are represented as positional encodings; masks are encoded using convolutions.

Mask Decoder: A lightweight transformer decoder that combines image embeddings with prompt embeddings to produce segmentation masks. It outputs three masks with confidence scores to handle ambiguity.

The key insight of SAM's architecture is the decoupled design: the heavy image encoder runs once, while the lightweight prompt encoder and mask decoder can run many times for different prompts on the same image.

Installation and Setup

import torch
import numpy as np
import cv2
import matplotlib.pyplot as plt
from segmentanything import sammodelregistry, SamPredictor, SamAutomaticMaskGenerator


def loadsammodel(checkpointpath, modeltype="vith", device="cuda"):
    """
    Load the SAM model.
    modeltype options: 'vith', 'vitl', 'vitb'
    """
    sam = sammodelregistrymodeltype

    sam.to(device=device)
    return sam

def displaymask(mask, ax, randomcolor=False):
    """Utility function to display a segmentation mask."""
    if randomcolor:

        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30 / 255, 144 / 255, 255 / 255, 0.6])
    h, w = mask.shape[-2:]
    maskimage = mask.reshape(h, w, 1)  color.reshape(1, 1, -1)

    ax.imshow(maskimage)


def displaypoints(coords, labels, ax, markersize=375):

    """Display prompt points on the image."""
    pospoints = coords[labels == 1]
    negpoints = coords[labels == 0]

    ax.scatter(pospoints[:, 0], pospoints[:, 1], color="green",

Segment Anything (SAM) Tutorial: Universal Image Segmentation

Segment Anything (SAM): A Comprehensive Tutorial

Table of Contents

Introduction

Prerequisites

ViT-L (large) - 1.2 GB

ViT-B (base, smallest) - 375 MB

Understanding SAM Architecture

Installation and Setup

Related Articles

Florence-2: Microsoft's Multi-Task Vision Foundation Model

Supervision: Computer Vision Toolkit by Roboflow

Albumentations Tutorial: Advanced Image Augmentation for Computer Vision

OpenCV + Deep Learning Tutorial: Modern Image Processing with Python

Related Articles

Florence-2: Microsoft's Multi-Task Vision Foundation Model

Florence-2: Model Vision Multi-Task dari Microsoft Daftar Isi Pendahuluan Prasyarat Instalasi Memuat Model Florence-2

Supervision: Computer Vision Toolkit by Roboflow

Supervision: Toolkit Computer Vision dari Roboflow Dalam proyek computer vision, setelah model mendeteksi objek, Anda ma...

Albumentations Tutorial: Advanced Image Augmentation for Computer Vision

Albumentations - Augmentasi Gambar Tingkat Lanjut Daftar Isi Pendahuluan Prasyarat Instalasi dan Pengaturan [Memahami Pi...

OpenCV + Deep Learning Tutorial: Modern Image Processing with Python

OpenCV + Deep Learning: Tutorial Komprehensif Daftar Isi Pendahuluan Prasyarat Dasar-Dasar Preprocessing Gambar [T...