Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation

About

With advances in multimodal research and deep learning, Multimodal Large Language Models (MLLMs) have emerged as a powerful paradigm for a wide range of multimodal tasks. As a core problem in vision-language research, Visual Question Answering (VQA) has increasingly employed MLLMs to improve performance, particularly in open-domain settings where external knowledge is essential. In this work, we aim to further enhance retrieval-based VQA by more effectively integrating MLLMs with structured reasoning and knowledge acquisition. We introduce a logical prompting strategy that fuses Chain-of-Thought (CoT) reasoning with Visual Question Decomposition (VQD), termed CoVQD, to guide retrieval toward more accurate and relevant knowledge for MLLM inference. Building on this idea, we propose a new framework, CoVQD-guided RAG (CgRAG), which enables MLLMs to access more comprehensive and coherent external knowledge while benefiting from structured visual-text reasoning guidance, thereby improving generalization and reliability in complex cross-domain VQA scenarios. Extensive experiments on E-VQA, InfoSeek, and OKVQA benchmarks demonstrate the effectiveness of the proposed method.

Quanxing Xu, Ling Zhou, Xian Zhong, Xiaohua Huang, Rubing Huang, Chia-Wen Lin• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringOK-VQA
Accuracy77.8
272
Visual Question AnsweringInfoSeek (Full)
Accuracy43
61
Knowledge-based Visual Question AnsweringE-VQA Single-Hop
Accuracy40.4
52
Knowledge-based Visual Question AnsweringINFOSEEK Unseen Question
Accuracy43.5
42
Visual Question AnsweringE-VQA All
Accuracy39.5
23
Visual Question AnsweringINFOSEEK Unseen-E
Accuracy42
23
Visual Question AnsweringGQA OOD (test)
Accuracy57.5
20
Explanation GenerationGQA-REX
BLEU-461.4
6
Question AnsweringGQA-REX (val)
Accuracy84.3
6
Question AnsweringGQA-REX (test)
Accuracy64.4
6
Showing 10 of 12 rows

Other info

Follow for update