Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

About

Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However, most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. To address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation, we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags, we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).

Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA--
963
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy83.95
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy89.8
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.9302
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy85.74
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy85.69
291
Visual Question AnsweringOKVQA
Top-1 Accuracy57.33
283
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy89.22
207
Object HallucinationPOPE (Random)
F1 Score88.33
200
Object HallucinationPOPE Adversarial
Accuracy85.37
196
Showing 10 of 24 rows

Other info

Code

Follow for update