Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

About

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao• 2024

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot v2
Avg Accuracy87.11
203
GUI GroundingScreenSpot Pro
Average Score1.89e+3
169
GUI GroundingScreenSpot
Avg Acc85.14
76
GUI GroundingOSWorld-G
Average Score27.7
74
GUI Action ExecutionGUI-EDA
Acoustic Score (COMSOL)51
60
GUI GroundingOSWorld-G (test)
Element Accuracy29.4
52
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)62
50
GUI GroundingMMBench-GUI L2 (test)
Error (Windows, Basic)36.9
46
GUI GroundingUI-Vision (test)
Basic Score12.2
43
GUI GroundingUI-Vision
Basic Score12.2
38
Showing 10 of 68 rows

Other info

Code

Follow for update