Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

About

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score7.35e+3
307
GUI GroundingScreenSpot v2
Avg Accuracy96.5
283
GUI GroundingScreenSpot Pro
Accuracy67.9
163
Mobile Task AutomationAndroidWorld (test)
Average Success Rate0.767
119
GUI GroundingOSWorld-G
Average Score67.6
107
GUI GroundingMMBench-GUI L2 (test)
Average Error82.6
67
GUI GroundingUI-Vision
Average Score40.7
59
GUI GroundingScreenSpot Desktop V2
Text Accuracy99
55
GUI GroundingScreenSpot Web V2
Text Accuracy97.9
55
GUI GroundingScreenSpot Mobile V2
Text Accuracy99.3
55
Showing 10 of 26 rows

Other info

Follow for update