Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoLLaVO: Crayon Large Language and Vision mOdel

About

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.2
2019
Visual Question AnsweringTextVQA
Accuracy64.2
1453
Visual Question AnsweringGQA
Accuracy61.4
1425
Multimodal UnderstandingMM-Vet
MM-Vet Score40.3
631
Multimodal ReasoningMM-Vet
MM-Vet Score40.3
517
Mathematical ReasoningMathVista
Score57.6
474
Multi-discipline Multimodal UnderstandingMMMU--
363
Science Question AnsweringScienceQA IMG
Accuracy80.7
335
Visual GroundingRefCOCO+ (val)
Accuracy80.87
253
Visual GroundingRefCOCO+ (testA)
Accuracy86.36
245
Showing 10 of 27 rows

Other info

Follow for update