WebWalker: Benchmarking LLMs in Web Traversal
About
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Web navigation | WebVoyager | Success Rate16.28 | 68 | |
| Web Navigation Question Answering | WebWalker QA | -- | 23 | |
| Web-based Question Answering | WebWalkerQA Multi-source | Success Rate (Easy)33.75 | 15 | |
| Web-based Question Answering | WebWalkerQA Single-source | Success Rate (Easy)35 | 15 | |
| Web-based Question Answering | WebWalkerQA Full Set | Overall Success Rate25.74 | 15 | |
| Web Navigation QA | WebVoyager | Average Action Count7 | 15 |