Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

About

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

Fatemeh Pesaran Zadeh, Seyeon Choi, Xing Han L\`u, Siva Reddy, Gunhee Kim• 2026

Related benchmarks

TaskDatasetResultRank
Web AgentWebArena
Success Rate19.2
36
Web-based interactionMiniWoB
Success Rate61.9
32
Web AgentWebArena Lite
SR21.2
18
Web AgentWorkArena L1
Success Rate38.8
18
Web AgentWorkArena L2
Success Rate4.7
18
Web navigationWebArena Lite
Success Rate (SR)12.1
5
Web navigationWorkArena L1
Success Rate7.6
5
Web navigationWorkArena L2
Success Rate6.8
5
Web navigationMiniWoB
Success Rate41.8
5
Multimodal GUI-agent action executionAndroid in the Wild (AITW) held-out 500-example (test)
AITW Accuracy6.6
3
Showing 10 of 10 rows

Other info

Follow for update