UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
About
The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Despite its success in language models, its application in multi-modal domains, particularly in graphic user interface (GUI) agent tasks, remains under-explored. To address this issue, we propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks. Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. Experimental results demonstrate that our proposed UI-R1-3B achieves significant improvements over the base model (i.e. Qwen2.5-VL-3B) on both in-domain (ID) and out-of-domain (OOD) tasks, with average accuracy gains of 22.1% on ScreenSpot, 6.0% on ScreenSpot-Pro, and 12.7% on ANDROIDCONTROL. Furthermore, UI-R1-3B delivers competitive performance compared to larger models (e.g., OS-Atlas-7B) trained via supervised fine-tuning (SFT) on 76K samples. We additionally develop an optimized version, UI-R1-E-3B, which significantly improves both grounding efficiency and accuracy. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain. Code website: https://github.com/lll6gg/UI-R1.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot Pro | Average Score33.5 | 307 | |
| GUI Grounding | ScreenSpot v2 | Avg Accuracy90 | 283 | |
| GUI Grounding | ScreenSpot Pro | -- | 163 | |
| Mobile GUI Automation | GUI-Odyssey | Success Rate (SR)32.5 | 62 | |
| GUI Grounding | ScreenSpot Mobile V2 | Text Accuracy99.6 | 55 | |
| GUI Grounding | ScreenSpot Web V2 | Text Accuracy94.8 | 55 | |
| GUI Grounding | ScreenSpot Desktop V2 | Text Accuracy95.6 | 55 | |
| Mobile GUI Automation | AndroidWorld | Overall Success Rate15.1 | 41 | |
| Grounding | ScreenSpot Pro | Average Grounding Accuracy43.3 | 33 | |
| GUI Interaction Control | GUI-Odyssey | SR43.5 | 31 |