How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

About

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.

Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang• 2026

Related benchmarks

Task	Dataset	Result
Multi-hop Question Answering	2Wiki	--	215
Multi-hop Question Answering	MuSiQue	--	209
Single-hop Question Answering	PopQA	--	186
Single-hop Question Answering	TriviaQA	--	133
Multi-hop Question Answering	Bamboogle	Accuracy44.8	62
Multi-hop Question Answering	Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle)	HotpotQA Score32.5	54
Multi-hop Question Answering	HotpotQA	Accuracy42.3	29
Single-hop Question Answering	Single-Hop QA NQ, TriviaQA, PopQA	NQ Score42.7	13
Question Answering	QA Benchmark Suite Aggregate	Average Score0.331	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord