MMSearch-R1: Incentivizing LMMs to Search
About
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Search-based Question Answering | MMSearch | Accuracy53.8 | 42 | |
| Visual Question Answering | LiveVQA | Accuracy48.4 | 42 | |
| Visual Question Answering | InfoSeek | Accuracy24.65 | 38 | |
| Multimodal Document Question Answering | MMLongBench-Doc | Acc (TXT Evidence)35.57 | 30 | |
| Document Visual Question Answering | MMLongBench-Doc | Accuracy29.92 | 29 | |
| Search-oriented Visual Question Answering | Search-oriented Benchmarks 1.0 (test val) | InfoSeek Score55.1 | 28 | |
| Visual Question Answering | SlideVQA | Single Accuracy76.64 | 28 | |
| Visual Question Answering | SimpleVQA | Accuracy0.5479 | 23 | |
| Fact-based Visual Question Answering | FVQA | Accuracy58.4 | 21 | |
| Visual Question Answering | FVQA | Accuracy42.39 | 16 |