Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning

About

Information seeking demands iterative evidence gathering and reflective reasoning, yet large language models (LLMs) still struggle with it in open-web question answering. Existing prompting and supervised fine-tuning (SFT) methods remain fixed by prompt rules or training corpora, and are usually benchmarked only on well-structured wiki sources, limiting real-world adaptability. We introduce WebPuzzle, a 24k-sample training and 275-sample test benchmark that evaluates information seeking on the live internet, across both wiki and open-domain queries. Leveraging 7k WebPuzzle instances, we develop DeepDiver, a reinforcement-learning (RL) framework that cultivates Search Intensity Scaling (SIS)-an emergent ability to escalate search frequency and depth instead of settling on overconfident, under-evidenced answers. With SIS, Qwen2.5-7B-Instruct and Pangu-7B-Reasoner attain performance on real-web tasks comparable to the 671B-parameter DeepSeek-R1. We detail DeepDiver's curriculum from cold-start SFT to a well designed RL procedure, and show that its seeking policy generalized from closed-ended queries to open-ended generation such as long-form writing. Our results advance adaptive information seeking in LLMs and provide a rigorous benchmark for future work.

Wenxuan Shi, Haochen Tan, Chuqiao Kuang, Xiaoguang Li, Xiaozhe Ren, Chen Zhang, Hanting Chen, Yasheng Wang, Lu Hou, Lifeng Shang• 2025

Related benchmarks

TaskDatasetResultRank
Sort Edgefundamental dynamic graph tasks Level 0
Accuracy32
20
Motif ClassificationLLMTM 1.0 (test)
3-star55
12
Reverse Graphfundamental dynamic graph tasks Level 0
Accuracy31
10
When Link and Dislinkfundamental dynamic graph tasks Level 0
Accuracy45
10
Motif ConstructionMotif Construction various temporal motifs
4-Chordal Cycle Accuracy60
9
Motif DetectionMotif Detection
3-star25
9
Motif Occurrence PredictionLLMTM Level 2 1.0 (test)
Accuracy0.75
9
Multi-Motif CountingLLMTM Level 2 1.0 (test)
Accuracy0.19
9
Multi-Motif DetectionLLMTM Level 2 1.0 (test)
Accuracy10.71
9
Showing 9 of 9 rows

Other info

Follow for update