DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning
About
Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sort Edge | fundamental dynamic graph tasks Level 0 | Accuracy32 | 20 | |
| Motif Classification | LLMTM 1.0 (test) | 3-star55 | 12 | |
| Reverse Graph | fundamental dynamic graph tasks Level 0 | Accuracy31 | 10 | |
| When Link and Dislink | fundamental dynamic graph tasks Level 0 | Accuracy45 | 10 | |
| Motif Construction | Motif Construction various temporal motifs | 4-Chordal Cycle Accuracy60 | 9 | |
| Motif Detection | Motif Detection | 3-star25 | 9 | |
| Motif Occurrence Prediction | LLMTM Level 2 1.0 (test) | Accuracy0.75 | 9 | |
| Multi-Motif Counting | LLMTM Level 2 1.0 (test) | Accuracy0.19 | 9 | |
| Multi-Motif Detection | LLMTM Level 2 1.0 (test) | Accuracy10.71 | 9 |