Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

About

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

Yunxin Li, Xinyu Chen, Baotian Hu, Haoyuan Shi, Min Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.88
1165
Visual Question AnsweringTextVQA
Accuracy50.48
1117
Object Hallucination EvaluationPOPE--
935
Visual Question AnsweringOK-VQA
Accuracy58.91
224
Multimodal ReasoningMMBench
Accuracy59.78
50
Multi-choice Visual Question AnsweringA-OKVQA
Accuracy82.71
49
Visual Question AnsweringInfoSeek
Accuracy15.45
38
Spatial UnderstandingSEED-Bench Spatial
Accuracy66.28
15
Visual Question AnsweringA-OKVQA Open-Ended
Accuracy72.14
15
Scientific Question AnsweringScienceQA I
Accuracy69.96
8
Showing 10 of 10 rows

Other info

Code

Follow for update