Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

About

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02\% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, Xiaobo Xia• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score1.60e+3
307
GUI GroundingScreenSpot v2
Avg Accuracy88.2
283
GUI GroundingScreenSpot Pro--
163
GUI GroundingScreenSpot
Avg Acc77
133
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)47.7
62
GUI GroundingScreenSpot Mobile V2
Text Accuracy98.8
55
GUI GroundingScreenSpot Web V2
Text Accuracy93.3
55
GUI GroundingScreenSpot Desktop V2
Text Accuracy94
55
GUI State ControlState Control Benchmark
Operational Task Success Rate (O-TMR)90.24
54
GroundingScreenSpot Pro
Average Grounding Accuracy39.3
33
Showing 10 of 70 rows

Other info

Follow for update