Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs

About

Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless, their performance in fine-grained image understanding tasks is still limited. To address this issue, this paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. Specifically, we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. A self-consistent bootstrapping method is also introduced to extend existing dense object annotations into high-quality referring-expression-bounding-box pairs. These methods enable the generation of high-quality instruction data which includes a wide range of fundamental abilities essential for fine-grained image perception. Moreover, we argue that the visual encoder should be tuned during instruction tuning to mitigate the gap between full image perception and fine-grained image perception. Experimental results demonstrate the superior performance of our method. For instance, our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also attained the top rank on the leaderboard of MMBench. This promising performance is achieved by training on only publicly available data, making it easily reproducible. The models, datasets, and codes are publicly available at https://github.com/SY-Xuan/Pink.

Shiyu Xuan, Qingpei Guo, Ming Yang, Shiliang Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA--
963
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy81.4
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy88.3
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.917
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy83.7
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy83.7
291
Visual Question AnsweringOKVQA
Top-1 Accuracy59.5
283
Referring Expression ComprehensionRefCOCO+ (test-A)
Accuracy87.5
172
Referring Expression ComprehensionRefCOCO+ (test-B)
Accuracy73.7
167
Referring Expression ComprehensionRefCOCO (test-B)
Accuracy84
160
Showing 10 of 12 rows

Other info

Follow for update