A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis
About
Pre-trained large language models (LLMs) have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Navigation | Multimodal-Mind2Web Cross-Website | Step Success Rate62.2 | 32 | |
| GUI Navigation | Multimodal-Mind2Web Cross-Task | Step Success Rate71.5 | 27 | |
| GUI Navigation | Multimodal-Mind2Web Cross-Domain | Step Success Rate67.1 | 27 | |
| Question Answering | WebSRC (dev) | EM76.91 | 26 | |
| Web automation | MiniWoB++ 56 tasks (test) | Success Rate85.6 | 15 | |
| Action Prediction | MIND2WEB Cross-Task 1.0 | Element Accuracy60.6 | 11 | |
| Action Prediction | MIND2WEB Cross-Website 1.0 | Element Accuracy47.6 | 11 | |
| Description Generation | Description Generation (test) | Accuracy98.9 | 9 | |
| Description Generation | Description Generation (dev) | Accuracy98.4 | 9 | |
| Offline Action Prediction | Mind2Web Cross-Domain v1.0 (test) | Element Accuracy50.2 | 4 |