Scaling Agents for Computer Use

About

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their performance on long-horizon, complex problems remains unreliable. Single-rollout execution is brittle, with small errors compounding over time and leading to high variance in outcomes. While prior work has attempted to scale within a single rollout, such approaches have yielded limited gains. Scaling over multiple rollouts offers a more promising alternative but doing so effectively is challenging due to the difficulty of evaluating and selecting among long-horizon agent behaviors. We introduce Behavior Judge (BJudge), which addresses this challenge by representing agent executions as behavior narratives and comparing candidate behaviors at this level, substantially improving robustness and success rates. Using multiple rollouts, BJudge establishes a new state of the art (SoTA) in OSWorld at 72.6%, significantly outperforming prior methods and surpassing human-level performance at 72.36%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the strong effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and BJudge provides a practical framework to achieve this.

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang• 2025

Related benchmarks

Task	Dataset	Result
GUI Agent Task	AndroidWorld	Success Rate68.1	200
OS GUI Agentic Task Execution	OSWorld 361 tasks (Verified)	OS Success Rate77.5	43
GUI Navigation and Action	OS World (test)	Success Rate (Avg)72.58	41
Windows UI Navigation	WindowsAgentArena (WAA)	Success Rate56.6	33
GUI Navigation	OSWorld (Verified)	OS Success Rate77.5	32
GUI Automation	OSWorld	Overall Success Rate72.6	28
Grounding	OSWorld	Overall Score47.5	22
GUI Automation	WindowsAgentArena	--	14
Desktop automation	WindowsAgentArena (WAA) v1 (test)	Overall Score56.6	13
Operating System Agent Control	WindowsAgentArena	Success Rate0.566	11

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord