Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Agents for Computer Use

About

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their performance on long-horizon, complex problems remains unreliable. Single-rollout execution is brittle, with small errors compounding over time and leading to high variance in outcomes. While prior work has attempted to scale within a single rollout, such approaches have yielded limited gains. Scaling over multiple rollouts offers a more promising alternative but doing so effectively is challenging due to the difficulty of evaluating and selecting among long-horizon agent behaviors. We introduce Behavior Judge (BJudge), which addresses this challenge by representing agent executions as behavior narratives and comparing candidate behaviors at this level, substantially improving robustness and success rates. Using multiple rollouts, BJudge establishes a new state of the art (SoTA) in OSWorld at 72.6%, significantly outperforming prior methods and surpassing human-level performance at 72.36%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the strong effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and BJudge provides a practical framework to achieve this.

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang• 2025

Related benchmarks

TaskDatasetResultRank
GUI Agent TaskAndroidWorld
Success Rate68.1
136
OS GUI Agentic Task ExecutionOSWorld 361 tasks (Verified)
OS Success Rate77.5
43
GUI Navigation and ActionOS World (test)
Success Rate (OS)79.17
26
GroundingOSWorld
Overall Score47.5
22
Windows UI NavigationWindowsAgentArena (WAA)
Success Rate56.6
14
GUI AutomationWindowsAgentArena--
11
Fine-Grained Action ExecutionOSExpert-Eval
GIMP Execution Time (s)702
10
Long-Horizon Composite SkillsOSExpert-Eval
Execution Time (GIMP)998
10
Unseen UI GeneralizationOSExpert-Eval
Execution Time (Tableau, s)623
10
Operating System Agent ControlWindowsAgentArena
Success Rate0.566
8
Showing 10 of 14 rows

Other info

Follow for update