mKG-RAG: Leveraging Multimodal Knowledge Graphs in Retrieval-Augmented Generation for Knowledge-intensive VQA

About

Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for expanding the knowledge capacity of Multimodal Large Language Models (MLLMs) by incorporating external knowledge sources into the generation process, and has been widely adopted for knowledge-based Visual Question Answering (VQA). Despite impressive advancements, vanilla RAG-based VQA methods that rely on unstructured documents and overlook the structural relations among knowledge elements frequently introduce irrelevant or misleading content, degrading answer accuracy and reliability. To overcome these challenges, a promising solution is to integrate multimodal knowledge graphs (KGs) into RAG-based VQA frameworks, thereby enhancing generation through structured multimodal knowledge. To this end, this paper proposes mKG-RAG, a novel retrieval-augmented generation framework built upon multimodal KGs for knowledge-intensive VQA tasks. Specifically, mKG-RAG leverages MLLM-driven graph extraction and vision-text matching to distill semantically consistent, modality-complementary entities and relations from multimodal documents, constructing high-quality multimodal KGs as structured knowledge representations. Furthermore, a dual-stage retrieval strategy equipped with a query-aware multimodal retriever is introduced to improve retrieval efficiency while progressively refining precision. Comprehensive experiments demonstrate that our approach significantly outperforms existing approaches and sets new state-of-the-art results for knowledge-based VQA. The code is available at https://github.com/xandery-geek/mKG-RAG.

Xu Yuan, Liangbo Ning, Qingqing Ye, Wenqi Fan, Qing Li• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	Enc-VQA (test)	Single-Hop Accuracy38.4	84
Visual Question Answering	InfoSeek	Unseen-Q Score41.4	67
Knowledge-Intensive Visual Question Answering	InfoSeek (val)	Accuracy (All)40.5	50
Knowledge-Intensive Visual Question Answering	E-VQA (test)	Accuracy (All)36.3	49
Visual Question Answering	InfoSeek (val)	Overall Accuracy40.5	45
Visual Question Answering	E-VQA	Accuracy (Single-Hop)38.4	34
Knowledge-based VQA	InfoSeek	Unseen-Q Performance32.9	18
Visual Question Answering	E-VQA	Accuracy38.4	15
Retrieval	InfoSeek	Recall@149.7	12

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord