Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Osprey: Pixel Understanding with Visual Instruction Tuning

About

Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.

Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu• 2023

Related benchmarks

TaskDatasetResultRank
Object HallucinationPOPE (Random)
F1 Score88.97
200
Object HallucinationPOPE Adversarial
Accuracy85.33
196
Object HallucinationPOPE Popular
F1 Score87.5
188
Panoptic SegmentationADE20K 150 59 (val)
Panoptic Quality (PQ)41.89
35
Referring expression generationRefCOCOg (val)
METEOR16.6
31
Instance SegmentationADE20K 150 59 (val)
AP41.24
30
Region-level captioningRefCOCOg (test)
CIDEr108.3
18
Semantic segmentationCityscapes 11 (val)
mIoU49.78
16
Region CaptioningVideoRefer-D (test)
Average Score2.41
16
Region CaptioningRefCOCOg (val)
METEOR16.6
14
Showing 10 of 24 rows

Other info

Code

Follow for update