Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WebWalker: Benchmarking LLMs in Web Traversal

About

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, Fei Huang• 2025

Related benchmarks

TaskDatasetResultRank
Web navigationWebVoyager
Success Rate16.28
68
Web Navigation Question AnsweringWebWalker QA--
23
Web-based Question AnsweringWebWalkerQA Multi-source
Success Rate (Easy)33.75
15
Web-based Question AnsweringWebWalkerQA Single-source
Success Rate (Easy)35
15
Web-based Question AnsweringWebWalkerQA Full Set
Overall Success Rate25.74
15
Web Navigation QAWebVoyager
Average Action Count7
15
Showing 6 of 6 rows

Other info

Follow for update