WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

About

This paper tackles \textbf{open-ended deep research (OEDR)}, a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and monolithic generation paradigms that include redundant, irrelevant evidence, suffering from hallucination issues and low citation accuracy. To address these challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, citation-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank via citations for each part, it effectively mitigates long-context issues and citation hallucinations. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing comprehensive, trusted, and well-structured reports.

Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou• 2025

Related benchmarks

Task	Dataset	Result
Deep Research Report Generation	DeepResearch Bench	Comprehensiveness51.45	89
Comparative Performance Evaluation	DeepConsult	Win Rate66.86	24
Report Generation	DeepResearch Bench	Overall Score43.52	20
Report Generation	DeepResearch Bench 2025 (test)	Comprehensiveness45.2	16
Deep Research	DeepResearch Bench (test)	Comprehensiveness51.29	14
Report Generation	DeepResearch Gym (test)	Clarity62.1	10
Open-Ended Deep Research	DeepResearchGym	Clarity90.71	9
Open-Ended Deep Research	DeepConsult	Win Rate61.27	9
Open-ended deep research evaluation	DeepResearch Bench 100 PhD-level research tasks	Comprehensiveness51.45	9
Deep Research	DeepConsult (test)	Win Rate66.16	8

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord