Florence-2: Microsoft's Multi-Task Vision Foundation Model
Table of Contents
Introduction
Florence-2 is a vision foundation model from Microsoft designed to handle a wide variety of computer vision tasks with a single unified model. Unlike traditional models trained for one specific task, Florence-2 uses a sequence-to-sequence approach that enables captioning, object detection, OCR, segmentation, visual grounding, and more simply by changing the text prompt.
Key advantages of Florence-2:
- Multi-task in one model: A single model for captioning, detection, OCR, segmentation, and more
- Prompt-based: Simply change the text prompt to switch between tasks
- Lightweight and efficient: Available in base (232M parameters) and large (771M parameters) variants
- Zero-shot capability: Can be used directly without fine-tuning for many tasks
- Fine-tunable: Can be fine-tuned on custom datasets for improved performance
In this tutorial, you will learn to use Florence-2 for various vision tasks, perform batch processing, fine-tune on custom datasets, and build a practical document analysis pipeline.
Prerequisites
- Python 3.9 or higher
- GPU with at least 8GB VRAM (for base model) or 16GB VRAM (for large model)
- Basic understanding of PyTorch and Hugging Face Transformers
- Familiarity with basic computer vision concepts (bounding boxes, segmentation, etc.)
Installation
# Install main dependencies
pip install torch torchvision torchaudio
Install transformers and Florence-2 dependencies
pip install transformers einops timm flashattn
Install additional libraries for image processing
pip install Pillow opencv-python matplotlib
Install supervision for visualization
pip install supervision
Install libraries for fine-tuning
pip install datasets peft accelerate
Verify the installation:
import torch
import transformers
print(f"PyTorch version: {torch.version}")
print(f"Transformers version: {transformers.version}")
print(f"CUDA available: {torch.cuda.isavailable()}")
if torch.cuda.isavailable():
print(f"GPU: {torch.cuda.getdevicename(0)}")
print(f"VRAM: {torch.cuda.getdeviceproperties(0).totalmem / 1e9:.1f} GB")
Loading Florence-2 Models
Florence-2 comes in two size variants. Choose based on your needs and hardware capacity.
Florence-2-base
from transformers import AutoProcessor, AutoModelForCausalLM
import torch
Load base model (232M parameters)
modelid = "microsoft/Florence-2-base"
processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)
model = AutoModelForCausalLM.frompretrained(
modelid,
torchdtype=torch.float16,
trustremotecode=True
).to("cuda")
print("Florence-2-base loaded successfully!")
Florence-2-large
# Load large model (771M parameters) for higher accuracy
modelid = "microsoft/Florence-2-large"
processor = AutoProcessor.frompretrained(modelid, trustremotecode=True)
model = AutoModelForCausalLM.frompretrained(