Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Structured Distillation of Web Agent Capabilities Enables Generalization

About

Frontier LLMs can navigate complex websites, but their cost and reliance on third-party APIs make local deployment impractical. We introduce Agent-as-Annotators, a framework that structures synthetic trajectory generation for web agents by analogy to human annotation roles, replacing the Task Designer, Annotator, and Supervisor with modular LLM components. Using Gemini 3 Pro as teacher, we generate 3,000 trajectories across six web environments and fine-tune a 9B-parameter student with pure supervised learning on the 2,322 that pass quality filtering. The resulting model achieves 41.5% on WebArena, surpassing closed-source models such as Claude 3.5 Sonnet (36.0%) and GPT-4o (31.5%) under the same evaluation protocol, and nearly doubling the previous best open-weight result (Go-Browse, 21.7%). Capabilities transfer to unseen environments, with an 18.2 percentage point gain on WorkArena L1 (an enterprise platform never seen during training) and consistent improvements across three additional benchmarks. Ablations confirm that each pipeline component contributes meaningfully, with Judge filtering, evaluation hints, and reasoning traces each accounting for measurable gains. These results demonstrate that structured trajectory synthesis from a single frontier teacher is sufficient to produce competitive, locally deployable web agents. Project page: https://agent-as-annotators.github.io

Xing Han L\`u, Siva Reddy• 2026

Related benchmarks

TaskDatasetResultRank
Web navigationWebArena
Overall Success Rate41.5
48
Enterprise interface task completionWorkArena L1
Task Success Rate51.5
14
Web-based interactionMiniWoB
Success Rate69
14
Web task automationVisualWebArena full
SR33.9
12
Web Agent NavigationWebArena (test)
Success Rate41.5
10
Web Agent NavigationMiniWoB (full)
Success Rate69
10
Web Agent NavigationWorkArena L1 (full)
Success Rate51.5
10
Web Agent NavigationWorkArena L2 147-task (test)
Success Rate9.7
10
Enterprise interface task completionWorkArena++ L2
Success Rate9.7
9
Web Agent NavigationVisualWebArena (test)
Success Rate33.9
7
Showing 10 of 14 rows

Other info

GitHub

Follow for update