Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering

About

Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Although cutting-edge multimodal large language models (MLLMs) such as GPT-4o achieve strong performance on VQA tasks, they frequently fall short in accessing domain-specific or the latest knowledge. To mitigate this issue, retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs), referred to as KB-VQA, emerges as a promising approach. Nevertheless, conventional unimodal retrieval techniques, which translate images into textual descriptions, often result in the loss of critical visual details. To address these challenges, this study presents two key innovations. First, we introduce fine-grained knowledge units that consist of multimodal data fragments (e.g. text fragments, entity images, and so on) in a structured manner. Rather than merely refining retrieval mechanisms, we prioritize the systematic organization and management of these knowledge units, ensuring that the structuring process itself enhances retrieval quality. Second, we propose a knowledge unit retrieval-augmented generation framework (KU-RAG) that seamlessly integrates fine-grained retrieval with MLLMs. Our KU-RAG framework not only ensures precise retrieval of relevant knowledge but also enhances reasoning capabilities through a knowledge correction chain. Experimental results demonstrate that our approach consistently outperforms existing KB-VQA methods across four benchmarks, achieving an average improvement of approximately 3% and up to 11% in the best case.

Zhengxuan Zhang, Yin Wu, Yuyu Luo, Nan Tang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringOK-VQA
Accuracy77.2
272
Visual Question AnsweringInfoSeek (Full)
Accuracy26.1
61
Knowledge-based Visual Question AnsweringE-VQA Single-Hop
Accuracy38.3
52
Knowledge-based Visual Question AnsweringInfoSeek
FR (All)26.1
18
Knowledge-based Visual Question AnsweringE-VQA
Final Fidelity Rate7.1
18
Showing 5 of 5 rows

Other info

Follow for update