Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

About

Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs). Although promising, existing heuristic mRAGs typically predefined fixed retrieval processes, which causes two issues: (1) Non-adaptive Retrieval Queries. (2) Overloaded Retrieval Queries. However, these flaws cannot be adequately reflected by current knowledge-seeking visual question answering (VQA) datasets, since the most required knowledge can be readily obtained with a standard two-step retrieval. To bridge the dataset gap, we first construct Dyn-VQA dataset, consisting of three types of "dynamic" questions, which require complex knowledge retrieval strategies variable in query, tool, and time: (1) Questions with rapidly changing answers. (2) Questions requiring multi-modal knowledge. (3) Multi-hop questions. Experiments on Dyn-VQA reveal that existing heuristic mRAGs struggle to provide sufficient and precisely relevant knowledge for dynamic questions due to their rigid retrieval processes. Hence, we further propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch. The underlying idea is to emulate the human behavior in question solution which dynamically decomposes complex multimodal questions into sub-question chains with retrieval action. Extensive experiments prove the effectiveness of our OmniSearch, also provide direction for advancing mRAG. The code and dataset will be open-sourced at https://github.com/Alibaba-NLP/OmniSearch.

Yangning Li, Yinghui Li, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S. Yu, Fei Huang, Jingren Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringLiveVQA
Accuracy40.9
108
Multimodal Search-based Question AnsweringMMSearch
Accuracy49.7
54
GeolocationGeoBrowse Level 2 BrowseComp-style
City Accuracy15.8
24
GeolocationGeoBrowse Level 1 Visual Cues
City Accuracy9.6
24
Multimodal Question AnsweringDyn-VQA
F1-Recall18.94
22
Multimodal Question Answering2WikiMQA
F1-Recall31.02
22
Multimodal Question AnsweringWebQA
F1-Recall58.02
22
Multimodal Question AnsweringAggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)
Average Score23.24
22
Visual Question AnsweringInfoSeek
F1 Recall24.45
22
Multimodal Question AnsweringOpen-WikiTable
F1 Recall7.72
22
Showing 10 of 18 rows

Other info

Follow for update