Florence-2 is a vision foundation model from Microsoft designed to handle a wide variety of computer vision tasks with a single unified model. Unlike traditional models trained for one specific task, Florence-2 uses a sequence-to-sequence approach that enables captioning, object detection, OCR, segmentation, visual grounding, and more simply by changing the text prompt.

Key advantages of Florence-2:

Multi-task in one model: A single model for captioning, detection, OCR, segmentation, and more
Prompt-based: Simply change the text prompt to switch between tasks
Lightweight and efficient: Available in base (232M parameters) and large (771M parameters) variants
Zero-shot capability: Can be used directly without fine-tuning for many tasks
Fine-tunable: Can be fine-tuned on custom datasets for improved performance

In this tutorial, you will learn to use Florence-2 for various vision tasks, perform batch processing, fine-tune on custom datasets, and build a practical document analysis pipeline.

Prerequisites

Python 3.9 or higher
GPU with at least 8GB VRAM (for base model) or 16GB VRAM (for large model)
Basic understanding of PyTorch and Hugging Face Transformers
Familiarity with basic computer vision concepts (bounding boxes, segmentation, etc.)

Installation

# Install main dependencies
pip install torch torchvision torchaudio

Install transformers and Florence-2 dependencies
pip install transformers einops timm flashattn


Install additional libraries for image processing
pip install Pillow opencv-python matplotlib

Install supervision for visualization
pip install supervision

Install libraries for fine-tuning
pip install datasets peft accelerate

Verify the installation:

import torch
import transformers

print(f"PyTorch version: {torch.version}")
print(f"Transformers version: {transformers.version}")
print(f"CUDA available: {torch.cuda.isavailable()}")
if torch.cuda.isavailable():

    print(f"GPU: {torch.cuda.getdevicename(0)}")

    print(f"VRAM: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")

Loading Florence-2 Models

Florence-2 comes in two size variants. Choose based on your needs and hardware capacity.

Florence-2-base

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

Load base model (232M parameters)
modelid = "microsoft/Florence-2-base"


processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)

model = AutoModelForCausalLM.frompretrained(
    modelid,

    torchdtype=torch.float16,
    trustremotecode=True
).to("cuda")

print("Florence-2-base loaded successfully!")

Florence-2-large

# Load large model (771M parameters) for higher accuracy
modelid = "microsoft/Florence-2-large"


processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)

model = AutoModelForCausalLM.frompretrained(

Florence-2: Microsoft's Multi-Task Vision Foundation Model

Florence-2: Microsoft's Multi-Task Vision Foundation Model

Table of Contents

Introduction

Prerequisites

Installation

Install transformers and Florence-2 dependencies

Install additional libraries for image processing

Install supervision for visualization

Install libraries for fine-tuning

Loading Florence-2 Models

Florence-2-base

Load base model (232M parameters)

Florence-2-large

Related Articles

Supervision: Computer Vision Toolkit by Roboflow

Complete Ultralytics Tutorial: Object Detection with YOLO

Complete Guide to DETR: Object Detection with Transformers

AutoGen: Microsoft's Multi-Agent Conversation Framework

Related Articles

Supervision: Computer Vision Toolkit by Roboflow

Supervision: Toolkit Computer Vision dari Roboflow Dalam proyek computer vision, setelah model mendeteksi objek, Anda ma...

Complete Ultralytics Tutorial: Object Detection with YOLO

Tutorial Lengkap Ultralytics: Object Detection dengan YOLO Ultralytics adalah framework Python yang menyediakan implemen...

Complete Guide to DETR: Object Detection with Transformers

Panduan Lengkap DETR: Object Detection dengan Transformers DETR (DEtection TRansformer) adalah pendekatan revolusioner u...

AutoGen: Microsoft's Multi-Agent Conversation Framework

AutoGen: Framework Multi-Agent Conversation dari Microsoft AutoGen adalah framework open-source dari Microsoft Research ...