Florence-2: Microsoft's Multi-Task Vision Foundation Model

# Florence-2: Model Vision Multi-Task dari Microsoft ## Daftar Isi 1. [Pendahuluan](#pendahuluan) 2. [Prasyarat](#prasyarat) 3. [Instalasi](#instalasi) 4. [Memuat Model Florence-2](#memuat-model-flo...

By Ruby Abdullah · · tutorial
Florence-2Computer VisionMicrosoftObject DetectionPython

Florence-2: Microsoft's Multi-Task Vision Foundation Model

Table of Contents

  • Introduction
  • Prerequisites
  • Installation
  • Loading Florence-2 Models
  • Understanding Task Prompts
  • Image Captioning
  • Object Detection
  • OCR (Optical Character Recognition)
  • Region Proposal and Segmentation
  • Visual Grounding and Referring Expression
  • Batch Processing
  • Fine-tuning on Custom Dataset
  • Integration with Supervision for Visualization
  • Practical Project: Document Analysis Pipeline
  • Best Practices
  • Conclusion
  • Introduction

    Florence-2 is a vision foundation model from Microsoft designed to handle a wide variety of computer vision tasks with a single unified model. Unlike traditional models trained for one specific task, Florence-2 uses a sequence-to-sequence approach that enables captioning, object detection, OCR, segmentation, visual grounding, and more simply by changing the text prompt.

    Key advantages of Florence-2:

    • Multi-task in one model: A single model for captioning, detection, OCR, segmentation, and more
    • Prompt-based: Simply change the text prompt to switch between tasks
    • Lightweight and efficient: Available in base (232M parameters) and large (771M parameters) variants
    • Zero-shot capability: Can be used directly without fine-tuning for many tasks
    • Fine-tunable: Can be fine-tuned on custom datasets for improved performance

    In this tutorial, you will learn to use Florence-2 for various vision tasks, perform batch processing, fine-tune on custom datasets, and build a practical document analysis pipeline.

    Prerequisites

    • Python 3.9 or higher
    • GPU with at least 8GB VRAM (for base model) or 16GB VRAM (for large model)
    • Basic understanding of PyTorch and Hugging Face Transformers
    • Familiarity with basic computer vision concepts (bounding boxes, segmentation, etc.)

    Installation

    # Install main dependencies
    

    pip install torch torchvision torchaudio

    Install transformers and Florence-2 dependencies

    pip install transformers einops timm flashattn

    Install additional libraries for image processing

    pip install Pillow opencv-python matplotlib

    Install supervision for visualization

    pip install supervision

    Install libraries for fine-tuning

    pip install datasets peft accelerate

    Verify the installation:

    import torch
    

    import transformers

    print(f"PyTorch version: {torch.version}")

    print(f"Transformers version: {transformers.version}")

    print(f"CUDA available: {torch.cuda.isavailable()}")

    if torch.cuda.isavailable():

    print(f"GPU: {torch.cuda.getdevicename(0)}")

    print(f"VRAM: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")

    Loading Florence-2 Models

    Florence-2 comes in two size variants. Choose based on your needs and hardware capacity.

    Florence-2-base

    from transformers import AutoProcessor, AutoModelForCausalLM
    

    import torch

    Load base model (232M parameters)

    modelid = "microsoft/Florence-2-base"

    processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)

    model = AutoModelForCausalLM.frompretrained(

    modelid,

    torchdtype=torch.float16,

    trustremotecode=True

    ).to("cuda")

    print("Florence-2-base loaded successfully!")

    Florence-2-large

    # Load large model (771M parameters) for higher accuracy
    

    modelid = "microsoft/Florence-2-large"

    processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)

    model = AutoModelForCausalLM.frompretrained(

    Related Articles

    Supervision: Computer Vision Toolkit by Roboflow

    Supervision: Toolkit Computer Vision dari Roboflow Dalam proyek computer vision, setelah model mendeteksi objek, Anda ma...

    Complete Ultralytics Tutorial: Object Detection with YOLO

    Tutorial Lengkap Ultralytics: Object Detection dengan YOLO Ultralytics adalah framework Python yang menyediakan implemen...

    Complete Guide to DETR: Object Detection with Transformers

    Panduan Lengkap DETR: Object Detection dengan Transformers DETR (DEtection TRansformer) adalah pendekatan revolusioner u...

    AutoGen: Microsoft's Multi-Agent Conversation Framework

    AutoGen: Framework Multi-Agent Conversation dari Microsoft AutoGen adalah framework open-source dari Microsoft Research ...