How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1
About
Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | MuSiQue | -- | 106 | |
| Single-hop Question Answering | TriviaQA | -- | 62 | |
| Single-hop Question Answering | PopQA | -- | 55 | |
| Multi-hop Question Answering | Bamboogle | Accuracy44.8 | 52 | |
| Multi-hop Question Answering | 2Wiki | -- | 41 | |
| Multi-hop Question Answering | Multi-Hop QA (HotpotQA, 2Wiki, Musique, Bamboogle) | HotpotQA Score32.5 | 39 | |
| Multi-hop Question Answering | HotpotQA | Accuracy42.3 | 24 | |
| Question Answering | QA Benchmark Suite Aggregate | Average Score0.331 | 4 | |
| Single-hop Question Answering | Single-Hop QA NQ, TriviaQA, PopQA | NQ Score42.7 | 4 |