MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs
About
Large Language Models (LLMs) often suffer from hallucinations, which Retrieval-Augmented Generation (RAG) and GraphRAG mitigate by incorporating external knowledge and knowledge graphs (KGs). However, GraphRAG remains text-centric due to the difficulty of constructing fine-grained Multimodal KGs (MMKGs). Existing fusion methods, such as shared embeddings or captioning, require task-specific training and fail to preserve visual structural knowledge or cross-modal reasoning paths. To bridge this gap, we propose MMGraphRAG, which integrates visual scene graphs with text KGs via a novel cross-modal fusion approach. It introduces SpecLink, a method leveraging spectral clustering for accurate cross-modal entity linking and path-based retrieval to guide generation. We also release the CMEL dataset, specifically designed for fine-grained multi-entity alignment in complex multimodal scenarios. Evaluations on CMEL, DocBench, and MMLongBench demonstrate that MMGraphRAG achieves state-of-the-art performance, showing robust domain adaptability and superior multimodal information processing capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Knowledge-based VQA | InfoSeek | Unseen-Q Performance0.69 | 18 | |
| Multimodal Document Question Answering | DocBench | Accuracy (Academia)60.7 | 17 | |
| Multimodal Reasoning | ScienceQA | Natural Science Accuracy81.08 | 17 | |
| Multimodal Classification | CrisisMMD | BC Accuracy68.4 | 16 | |
| Knowledge-based VQA | E-VQA | Single-Hop Accuracy19.12 | 16 | |
| Multimodal Document Question Answering | MMLongBench (test) | Chart Acc.34.7 | 12 | |
| Multimodal Document QA | VisDoMBench FetaTab (full) | Accuracy72.4 | 11 | |
| Multimodal Document QA | VisDoMBench PaperTab (full) | Accuracy56.36 | 11 | |
| Multimodal Document QA | VisDoMBench SciGraphQA (full) | Accuracy64.11 | 11 | |
| Multimodal Document QA | VisDoMBench SPIQA (full) | Accuracy69.91 | 11 |