Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

About

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score6.16e+3
307
GUI GroundingScreenSpot v2
Avg Accuracy94.2
283
GUI GroundingScreenSpot Pro
Accuracy61.6
163
Web navigation and task completionWebArena (test)
Average Task Completion23.45
137
GUI Agent TaskAndroidWorld
Success Rate64.2
136
GUI GroundingScreenSpot
Avg Acc89.5
133
Mobile Task AutomationAndroidWorld (test)
Average Success Rate0.466
119
GUI GroundingOSWorld-G
Average Score57.1
107
GUI GroundingMMBench-GUI L2 (test)
Average Error64.3
67
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)87
62
Showing 10 of 141 rows
...

Other info

Follow for update