Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

About

We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP43.4
2454
Visual GroundingRefCOCO+ (val)
Accuracy88.3
171
Visual GroundingRefCOCO (testB)
Accuracy92
125
Visual GroundingRefCOCO (val)
Accuracy93.4
119
Visual GroundingRefCOCO (testA)
Accuracy95.3
117
Visual GroundingRefCOCOg (test)
Accuracy91.7
96
Visual GroundingRefCOCOg (val)
Accuracy91.2
93
Handwriting RetrievalHandwriting In-Domain Set
Accuracy@165.75
30
Handwriting RetrievalHandwriting Spanish synthetic disjoint fonts (Out-of-Domain (OOD))
Top-1 Accuracy41.13
30
Referring Expression ComprehensionRefCOCO
Precision@0.593.4
12
Showing 10 of 26 rows

Other info

Follow for update