Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

About

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao• 2024

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score1.89e+3
307
GUI GroundingScreenSpot v2
Avg Accuracy87.11
283
GUI GroundingScreenSpot Pro
Accuracy18.9
163
GUI GroundingScreenSpot
Avg Acc85.14
133
GUI GroundingOSWorld-G
Average Score27.7
107
GUI GroundingMMBench-GUI L2 (test)
Average Error41.4
67
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)62
62
GUI Action ExecutionGUI-EDA
Acoustic Score (COMSOL)51
60
GUI GroundingUI-Vision
Average Score9.02
59
GUI GroundingScreenSpot Desktop V2
Text Accuracy92.8
55
Showing 10 of 116 rows
...

Other info

Code

Follow for update