Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

About

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than merely refining retrieval mechanisms, we prioritize the systematic organization and management of these knowledge units, ensuring that the structuring process itself enhances retrieval quality. Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs. Our KU-RAG framework not only ensures precise retrieval of relevant knowledge but also enhances reasoning capabilities through a knowledge correction chain. Experimental results demonstrate that our approach consistently outperforms existing KB-VQA methods across four benchmarks, achieving an average improvement of approximately 3% and up to 11% in the best case.

Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	OK-VQA	Accuracy77.2	331
Visual Question Answering	InfoSeek (Full)	Accuracy26.1	61
Knowledge-based Visual Question Answering	E-VQA Single-Hop	Accuracy38.3	52
Knowledge-based Visual Question Answering	InfoSeek	FR (All)26.1	18
Knowledge-based Visual Question Answering	E-VQA	Final Fidelity Rate7.1	18

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord