Scaling Web Agent Training through Automatic Data Generation and Fine-grained Evaluation

About

We present a scalable pipeline for automatically generating high-quality training data for web agents. In particular, a major challenge in identifying high-quality training instances is trajectory evaluation - quantifying how much progress was made towards task completion. We introduce a novel constraint-based evaluation framework that provides fine-grained assessment of progress towards task completion. This enables us to leverage partially successful trajectories, which significantly expands the amount of usable training data. We evaluate our method on a new benchmark we propose called BookingArena, which consists of complex booking tasks across 20 popular websites, and demonstrate that our distilled student model outperforms open-source approaches and matches or exceeds commercial systems, while being a significantly smaller model. Our work addresses the challenge of efficiently creating diverse, realistic web interaction datasets and provides a systematic evaluation methodology for complex structured web tasks.

Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Creighton Glasscock, Honglak Lee• 2026

Related benchmarks

Task	Dataset	Result
Web navigation	WebVoyager OOD	Allrecipes6.6	30
Web navigation	WebVoyager 1.0 (test)	Allrecipes62.5	12
Web Agent Navigation	BookingArena 1.0 (test)	Booking.com Success Rate50	5

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord