Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

About

The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, Hongsheng Li• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score33.5
307
GUI GroundingScreenSpot v2
Avg Accuracy90
283
GUI GroundingScreenSpot Pro--
163
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)32.5
62
GUI GroundingScreenSpot Mobile V2
Text Accuracy99.6
55
GUI GroundingScreenSpot Web V2
Text Accuracy94.8
55
GUI GroundingScreenSpot Desktop V2
Text Accuracy95.6
55
Mobile GUI AutomationAndroidWorld
Overall Success Rate15.1
41
GroundingScreenSpot Pro
Average Grounding Accuracy43.3
33
GUI Interaction ControlGUI-Odyssey
SR43.5
31
Showing 10 of 37 rows

Other info

Follow for update