DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

About

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lu Hou, Lifeng Shang• 2025

Related benchmarks

Task	Dataset	Result
Sort Edge	fundamental dynamic graph tasks Level 0	Accuracy32	20
Motif Classification	LLMTM 1.0 (test)	3-star55	12
Reverse Graph	fundamental dynamic graph tasks Level 0	Accuracy31	10
When Link and Dislink	fundamental dynamic graph tasks Level 0	Accuracy45	10
Motif Construction	Motif Construction various temporal motifs	4-Chordal Cycle Accuracy60	9
Motif Detection	Motif Detection	3-star25	9
Motif Occurrence Prediction	LLMTM Level 2 1.0 (test)	Accuracy0.75	9
Multi-Motif Counting	LLMTM Level 2 1.0 (test)	Accuracy0.19	9
Multi-Motif Detection	LLMTM Level 2 1.0 (test)	Accuracy10.71	9

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord