CoLLaVO: Crayon Large Language and Vision mOdel
About
The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy64.2 | 1117 | |
| Visual Question Answering | GQA | Accuracy61.4 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy87.2 | 935 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score40.3 | 418 | |
| Mathematical Reasoning | MathVista | Score57.6 | 322 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score40.3 | 281 | |
| Multi-discipline Multimodal Understanding | MMMU | -- | 266 | |
| Science Question Answering | ScienceQA IMG | Accuracy80.7 | 256 | |
| Visual Grounding | RefCOCO+ (val) | Accuracy80.87 | 171 | |
| Visual Grounding | RefCOCO+ (testB) | Accuracy73.2 | 169 |