WebWalker: Benchmarking LLMs in Web Traversal

About

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang• 2025

Related benchmarks

Task	Dataset	Result
Web navigation	WebVoyager	Success Rate16.28	81
Web navigation	WebWalker Easy	Success Rate58.75	25
Web navigation	WebWalker Medium	Success Rate (SR)50	25
Web navigation	WebWalker Hard	Success Rate (SR)30	25
Web Navigation Question Answering	WebWalker QA	--	23
Web-based Question Answering	WebWalkerQA Multi-source	Success Rate (Easy)33.75	15
Web-based Question Answering	WebWalkerQA Single-source	Success Rate (Easy)35	15
Web-based Question Answering	WebWalkerQA Full Set	Overall Success Rate25.74	15
Web Navigation QA	WebVoyager	Average Action Count7	15

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord