MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

About

Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.

Xueyao Wan, Hang Yu• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Reasoning	ScienceQA	Average Accuracy78.21	45
Knowledge-based VQA	InfoSeek	Unseen-Q Performance0.69	18
Multimodal Document Question Answering	DocBench	Accuracy (Academia)60.7	17
Multimodal Classification	CrisisMMD	BC Accuracy68.4	16
Knowledge-based VQA	E-VQA	Single-Hop Accuracy19.12	16
Multimodal Document Question Answering	MMLongBench (test)	Chart Acc.34.7	12
Multimodal Document QA	VisDoMBench FetaTab (full)	Accuracy72.4	11
Multimodal Document QA	VisDoMBench PaperTab (full)	Accuracy56.36	11
Multimodal Document QA	VisDoMBench SciGraphQA (full)	Accuracy64.11	11
Multimodal Document QA	VisDoMBench SPIQA (full)	Accuracy69.91	11

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord