Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
About
With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | OK-VQA | Accuracy77.8 | 272 | |
| Visual Question Answering | InfoSeek (Full) | Accuracy43 | 61 | |
| Knowledge-based Visual Question Answering | E-VQA Single-Hop | Accuracy40.4 | 52 | |
| Knowledge-based Visual Question Answering | INFOSEEK Unseen Question | Accuracy43.5 | 42 | |
| Visual Question Answering | E-VQA All | Accuracy39.5 | 23 | |
| Visual Question Answering | INFOSEEK Unseen-E | Accuracy42 | 23 | |
| Visual Question Answering | GQA OOD (test) | Accuracy57.5 | 20 | |
| Explanation Generation | GQA-REX | BLEU-461.4 | 6 | |
| Question Answering | GQA-REX (val) | Accuracy84.3 | 6 | |
| Question Answering | GQA-REX (test) | Accuracy64.4 | 6 |