Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Scaling Agents for Computer Use

About

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their performance on long-horizon, complex problems remains unreliable. Single-rollout execution is brittle, with small errors compounding over time and leading to high variance in outcomes. While prior work has attempted to scale within a single rollout, such approaches have yielded limited gains. Scaling over multiple rollouts offers a more promising alternative but doing so effectively is challenging due to the difficulty of evaluating and selecting among long-horizon agent behaviors. We introduce Behavior Judge (BJudge), which addresses this challenge by representing agent executions as behavior narratives and comparing candidate behaviors at this level, substantially improving robustness and success rates. Using multiple rollouts, BJudge establishes a new state of the art (SoTA) in OSWorld at 72.6%, significantly outperforming prior methods and surpassing human-level performance at 72.36%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the strong effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and BJudge provides a practical framework to achieve this.

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang• 2025

Related benchmarks

TaskDatasetResultRank
GUI Agent TaskAndroidWorld
Success Rate68.1
104
GUI Navigation and ActionOS World (test)
Success Rate (OS)79.17
26
OS GUI Agentic Task ExecutionOSWorld 361 tasks (Verified)
Average Success Rate62.63
21
Windows UI NavigationWindowsAgentArena (WAA)
Success Rate56.6
14
GUI AutomationWindowsAgentArena--
11
Operating System Agent ControlWindowsAgentArena
Success Rate0.566
8
Showing 6 of 6 rows

Other info

Follow for update