Osprey: Pixel Understanding with Visual Instruction Tuning
About
Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination | POPE (Random) | F1 Score88.97 | 200 | |
| Object Hallucination | POPE Adversarial | Accuracy85.33 | 196 | |
| Object Hallucination | POPE Popular | F1 Score87.5 | 188 | |
| Panoptic Segmentation | ADE20K 150 59 (val) | Panoptic Quality (PQ)41.89 | 35 | |
| Referring expression generation | RefCOCOg (val) | METEOR16.6 | 31 | |
| Instance Segmentation | ADE20K 150 59 (val) | AP41.24 | 30 | |
| Region-level captioning | RefCOCOg (test) | CIDEr108.3 | 18 | |
| Semantic segmentation | Cityscapes 11 (val) | mIoU49.78 | 16 | |
| Region Captioning | VideoRefer-D (test) | Average Score2.41 | 16 | |
| Region Captioning | RefCOCOg (val) | METEOR16.6 | 14 |