Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection
About
Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Web Agent | WebArena | Success Rate19.2 | 36 | |
| Web-based interaction | MiniWoB | Success Rate61.9 | 32 | |
| Web Agent | WebArena Lite | SR21.2 | 18 | |
| Web Agent | WorkArena L1 | Success Rate38.8 | 18 | |
| Web Agent | WorkArena L2 | Success Rate4.7 | 18 | |
| Web navigation | WebArena Lite | Success Rate (SR)12.1 | 5 | |
| Web navigation | WorkArena L1 | Success Rate7.6 | 5 | |
| Web navigation | WorkArena L2 | Success Rate6.8 | 5 | |
| Web navigation | MiniWoB | Success Rate41.8 | 5 | |
| Multimodal GUI-agent action execution | Android in the Wild (AITW) held-out 500-example (test) | AITW Accuracy6.6 | 3 |