UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface
About
Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling. However, effectively integrating fine-grained perception tasks like detection and segmentation into these models remains a significant challenge. This is primarily because these tasks often rely heavily on task-specific designs and architectures that can complicate the modeling process. To address this challenge, we present \ours, a framework that \textbf{U}nifies \textbf{F}ine-grained visual perception tasks through an \textbf{O}pen-ended language interface. By transforming all perception targets into the language space, \ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Additionally, we introduce a novel embedding retrieval approach that relies solely on the language interface to support segmentation tasks. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies while achieving comparable or superior performance to methods with intricate task-specific designs. After multi-task training on five standard visual perception datasets, \ours outperforms the previous state-of-the-art generalist models by 12.3 mAP on COCO instance segmentation and 3.3 mIoU on ADE20K semantic segmentation. Furthermore, our method seamlessly integrates with existing MLLMs, effectively combining fine-grained perception capabilities with their advanced language abilities, thereby enabling more challenging tasks such as reasoning segmentation. Code and models are available at https://github.com/nnnth/UFO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP48.9 | 2454 | |
| Semantic segmentation | ADE20K | mIoU54.5 | 936 | |
| Object Detection | COCO (val) | mAP48.9 | 613 | |
| Referring Expression Comprehension | RefCOCO+ (val) | -- | 345 | |
| Referring Expression Comprehension | RefCOCO (val) | -- | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | -- | 333 | |
| Referring Expression Comprehension | RefCOCOg (val) | -- | 291 | |
| Referring Expression Comprehension | RefCOCOg (test) | -- | 291 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | -- | 235 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU79.4 | 217 |