Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
About
We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP43.4 | 2454 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy88.3 | 171 | |
| Visual Grounding | RefCOCO (testB) | Accuracy92 | 125 | |
| Visual Grounding | RefCOCO (val) | Accuracy93.4 | 119 | |
| Visual Grounding | RefCOCO (testA) | Accuracy95.3 | 117 | |
| Visual Grounding | RefCOCOg (test) | Accuracy91.7 | 96 | |
| Visual Grounding | RefCOCOg (val) | Accuracy91.2 | 93 | |
| Handwriting Retrieval | Handwriting In-Domain Set | Accuracy@165.75 | 30 | |
| Handwriting Retrieval | Handwriting Spanish synthetic disjoint fonts (Out-of-Domain (OOD)) | Top-1 Accuracy41.13 | 30 | |
| Referring Expression Comprehension | RefCOCO | Precision@0.593.4 | 12 |