Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation

About

Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.

Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan, Shuo Wang, Yu Gu, Minghe Yu, Ge Yu, Maosong Sun• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	SlideVQA	Overall Accuracy63.43	46
Multimodal Question Answering	Open-WikiTable	F1 Recall53.9	22
Multimodal Question Answering	2WikiMQA	F1-Recall55.47	22
Multimodal Question Answering	TabFact	F1-Recall52.6	22
Multimodal Question Answering	WebQA	F1-Recall90.92	22
Multimodal Question Answering	Aggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA)	Average Score55.93	22
Visual Question Answering	InfoSeek	F1 Recall43.6	22
Multimodal Question Answering	Dyn-VQA	F1-Recall39.24	22
Visual Information Retrieval and Reasoning	ViDoSeek	Overall Accuracy66.11	18
Long-context Multi-modal Understanding	MMLongBench	Text Accuracy26.8	17

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord