Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

About

Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.

Xu Yuan, Liangbo Ning, Qingqing Ye, Wenqi Fan, Qing Li• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringEnc-VQA (test)
Single-Hop Accuracy38.4
84
Knowledge-Intensive Visual Question AnsweringInfoSeek (val)
Accuracy (All)40.5
50
Visual Question AnsweringInfoSeek
Unseen-Q Score41.4
49
Visual Question AnsweringInfoSeek (val)
Overall Accuracy40.5
45
Knowledge-Intensive Visual Question AnsweringE-VQA (test)
Accuracy (All)36.3
34
Visual Question AnsweringE-VQA
Accuracy (All)36.3
19
Knowledge-based VQAInfoSeek
Unseen-Q Performance32.9
18
Visual Question AnsweringE-VQA
Accuracy38.4
15
RetrievalInfoSeek
Recall@149.7
12
Showing 9 of 9 rows

Other info

Follow for update