LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

About

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	--	1425
Object Hallucination	POPE Popular	F1 Score85.94	372
Referring Expression Comprehension	RefCOCO+ (val)	Accuracy83.95	354
Object Hallucination	POPE Adversarial	Accuracy85.37	353
Referring Expression Comprehension	RefCOCO (val)	Accuracy89.8	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy0.9302	346
Object Hallucination	POPE (Random)	F1 Score88.33	324
Referring Expression Comprehension	RefCOCOg (test)	Accuracy85.74	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy85.69	300
Visual Question Answering	OKVQA	Top-1 Accuracy57.33	283

Showing 10 of 24 rows

Other info

Code

Follow for update

@wizwand_team Discord