OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
About
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements. This dataset, combined with innovations in model training, provides a solid foundation for OS-Atlas to understand GUI screenshots and generalize to unseen interfaces. Through extensive evaluation across six benchmarks spanning three different platforms (mobile, desktop, and web), OS-Atlas demonstrates significant performance improvements over previous state-of-the-art models. Our evaluation also uncovers valuable insights into continuously improving and scaling the agentic capabilities of open-source VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy87.11 | 203 | |
| GUI Grounding | ScreenSpot Pro | Average Score1.89e+3 | 169 | |
| GUI Grounding | ScreenSpot | Avg Acc85.14 | 76 | |
| GUI Grounding | OSWorld-G | Average Score27.7 | 74 | |
| GUI Action Execution | GUI-EDA | Acoustic Score (COMSOL)51 | 60 | |
| GUI Grounding | OSWorld-G (test) | Element Accuracy29.4 | 52 | |
| Mobile GUI Automation | GUI-Odyssey | Success Rate (SR)62 | 50 | |
| GUI Grounding | MMBench-GUI L2 (test) | Error (Windows, Basic)36.9 | 46 | |
| GUI Grounding | UI-Vision (test) | Basic Score12.2 | 43 | |
| GUI Grounding | UI-Vision | Basic Score12.2 | 38 |