Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

About

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA--
1249
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy83.95
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy89.8
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.9302
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy85.74
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy85.69
300
Object HallucinationPOPE Adversarial
Accuracy85.37
288
Object HallucinationPOPE (Random)
F1 Score88.33
285
Visual Question AnsweringOKVQA
Top-1 Accuracy57.33
283
Object HallucinationPOPE Popular
F1 Score85.94
273
Showing 10 of 24 rows

Other info

Code

Follow for update