Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

About

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringSlideVQA
Overall Accuracy63.43
46
Multimodal Question AnsweringOpen-WikiTable
F1 Recall53.9
22
Multimodal Question Answering2WikiMQA
F1-Recall55.47
22
Multimodal Question AnsweringTabFact
F1-Recall52.6
22
Multimodal Question AnsweringWebQA
F1-Recall90.92
22
Multimodal Question AnsweringAggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)
Average Score55.93
22
Visual Question AnsweringInfoSeek
F1 Recall43.6
22
Multimodal Question AnsweringDyn-VQA
F1-Recall39.24
22
Visual Information Retrieval and ReasoningViDoSeek
Overall Accuracy66.11
18
Long-context Multi-modal UnderstandingMMLongBench
Text Accuracy26.8
17
Showing 10 of 10 rows

Other info

Follow for update