ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
About
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot Pro | Average Score40.8 | 458 | |
| GUI Agent Task | AndroidWorld | Success Rate23.7 | 188 | |
| Mobile Task Automation | AndroidWorld (test) | Average Success Rate0.237 | 119 | |
| GUI Automation | OSWorld Verified (test) | Overall Success Rate15 | 40 | |
| GUI Web Agent Navigation | Mind2web Online | Overall Average Score23.7 | 37 | |
| GUI Navigation | AndroidWorld latest (test) | Success Rate23.7 | 35 | |
| Windows UI Navigation | WindowsAgentArena (WAA) | Success Rate24.2 | 33 | |
| GUI Agent Task Success | AndroidWorld (online) | Task Success Rate32.2 | 25 | |
| Action Prediction | AndroidControl Low v2 | -- | 22 | |
| Step Accuracy | AndroidControl High Level v2 | Pass@156.5 | 20 |