OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

About

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao• 2024

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score1.89e+3	482
GUI Grounding	ScreenSpot v2	Avg Accuracy87.11	447
GUI Grounding	ScreenSpot Pro	Accuracy18.9	221
GUI Grounding	ScreenSpot	Avg Acc85.14	169
GUI Grounding	OSWorld-G	Average Score27.7	164
GUI Grounding	MMBench-GUI L2 (test)	Average Error41.4	87
Grounding	ScreenSpot Pro	Average Grounding Accuracy33.1	82
GUI Grounding	ScreenSpot Desktop V2	Text Accuracy92.8	78
GUI Grounding	ScreenSpot Mobile V2	Text Accuracy95.2	78
GUI Grounding	ScreenSpot Web V2	Text Accuracy90.6	78

Showing 10 of 159 rows

...

Other info

Code

Follow for update

@wizwand_team Discord