Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs

About

Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.

Xueyao Wan, Hang Yu• 2025

Related benchmarks

TaskDatasetResultRank
Knowledge-based VQAInfoSeek
Unseen-Q Performance0.69
18
Multimodal Document Question AnsweringDocBench
Accuracy (Academia)60.7
17
Multimodal ReasoningScienceQA
Natural Science Accuracy81.08
17
Multimodal ClassificationCrisisMMD
BC Accuracy68.4
16
Knowledge-based VQAE-VQA
Single-Hop Accuracy19.12
16
Multimodal Document Question AnsweringMMLongBench (test)
Chart Acc.34.7
12
Multimodal Document QAVisDoMBench FetaTab (full)
Accuracy72.4
11
Multimodal Document QAVisDoMBench PaperTab (full)
Accuracy56.36
11
Multimodal Document QAVisDoMBench SciGraphQA (full)
Accuracy64.11
11
Multimodal Document QAVisDoMBench SPIQA (full)
Accuracy69.91
11
Showing 10 of 12 rows

Other info

Follow for update