Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
About
Multimodal Retrieval-Augmented Generation (MRAG) has shown promise in mitigating hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge. However, existing methods typically adhere to rigid retrieval paradigms by mimicking fixed retrieval trajectories and thus fail to fully exploit the knowledge of different retrieval experts through dynamic interaction based on the model's knowledge needs or evolving reasoning states. To overcome this limitation, we introduce Mixture-of-Retrieval Experts (MoRE), a novel framework that enables MLLMs to collaboratively interact with diverse retrieval experts for more effective knowledge exploitation. Specifically, MoRE learns to dynamically determine which expert to engage with, conditioned on the evolving reasoning state. To effectively train this capability, we propose Stepwise Group Relative Policy Optimization (Step-GRPO), which goes beyond sparse outcome-based supervision by encouraging MLLMs to interact with multiple retrieval experts and synthesize fine-grained rewards, thereby teaching the MLLM to fully coordinate all experts when answering a given query. Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines. Notably, MoRE exhibits strong adaptability by dynamically coordinating heterogeneous experts to precisely locate relevant information, validating its capability for robust, reasoning-driven expert collaboration. All codes and data are released on https://github.com/OpenBMB/MoRE.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | SlideVQA | Overall Accuracy63.43 | 46 | |
| Multimodal Question Answering | Open-WikiTable | F1 Recall53.9 | 22 | |
| Multimodal Question Answering | 2WikiMQA | F1-Recall55.47 | 22 | |
| Multimodal Question Answering | TabFact | F1-Recall52.6 | 22 | |
| Multimodal Question Answering | WebQA | F1-Recall90.92 | 22 | |
| Multimodal Question Answering | Aggregate (Open-WikiTable, 2WikiMQA, InfoSeek, Dyn-VQA, TabFact, WebQA) | Average Score55.93 | 22 | |
| Visual Question Answering | InfoSeek | F1 Recall43.6 | 22 | |
| Multimodal Question Answering | Dyn-VQA | F1-Recall39.24 | 22 | |
| Visual Information Retrieval and Reasoning | ViDoSeek | Overall Accuracy66.11 | 18 | |
| Long-context Multi-modal Understanding | MMLongBench | Text Accuracy26.8 | 17 |