Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

About

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

Fatemeh Pesaran Zadeh, Seyeon Choi, Xing Han L\`u, Siva Reddy, Gunhee Kim• 2026

Related benchmarks

Task	Dataset	Result
Web Agent	WebArena	Success Rate19.2	56
Web-based interaction	MiniWoB	Success Rate61.9	32
Web Agent	WebArena Lite	SR21.2	18
Web Agent	WorkArena L1	Success Rate38.8	18
Web Agent	WorkArena L2	Success Rate4.7	18
Web navigation	WorkArena L1	Success Rate7.6	12
Web navigation	WebArena Lite	Success Rate (SR)12.1	5
Web navigation	WorkArena L2	Success Rate6.8	5
Web navigation	MiniWoB	Success Rate41.8	5
Multimodal GUI-agent action execution	Android in the Wild (AITW) held-out 500-example (test)	AITW Accuracy6.6	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord