MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline

About

We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch

Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	SimpleVQA	Accuracy0.676	164
Visual Question Answering	LiveVQA	Accuracy68	116
Multimodal Search	MMSearch	Accuracy69	85
Fact-based Visual Question Answering	FVQA	Accuracy69.2	67
Multimodal Deep Search	BC-VL	Accuracy43	37
Agentic Search	MMSearch v1.0 (test)	Accuracy67.8	21
Multimodal Deep Search	MMSrch	Accuracy69	18
Agentic Search	BC-VL	Accuracy37.9	18
Information Seeking Question Answering	InfoSeek	Accuracy73.9	17
Web Browsing and Comparison	BrowseComp-VL	Accuracy43	17

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord