Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning

About

Graphical User Interface (GUI) agents have made substantial strides in understanding and executing user instructions across diverse platforms. Yet, grounding these instructions to precise interface elements remains challenging, especially in complex, high-resolution, professional environments. Traditional supervised finetuning (SFT) methods often require large volumes of diverse data and exhibit weak generalization. To overcome these limitations, we introduce a reinforcement learning (RL) based framework that incorporates three core strategies: (1) seed data curation to ensure high quality training samples, (2) a dense policy gradient that provides continuous feedback based on prediction accuracy, and (3) a self evolutionary reinforcement finetuning mechanism that iteratively refines the model using attention maps. With only 3k training samples, our 7B-parameter model achieves state-of-the-art results among similarly sized models on three grounding benchmarks. Notably, it attains 47.3\% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2\%. These findings underscore the effectiveness of RL-based approaches in enhancing GUI agent performance, particularly in high-resolution, complex environments.

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, Bo Li• 2025

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score4.73e+3	458
GUI Grounding	ScreenSpot v2	Avg Accuracy93.8	371
GUI Grounding	ScreenSpot Pro	Accuracy56.6	195
GUI Grounding	ScreenSpot	Avg Acc88.2	160
GUI Grounding	OSWorld-G	--	144
GUI Grounding	MMBench-GUI L2 (test)	Average Error76.6	67
GUI Grounding	ScreenSpot Mobile V2	Text Accuracy99.3	60
GUI Grounding	ScreenSpot Desktop V2	Text Accuracy96.4	60
GUI Grounding	ScreenSpot Web V2	Text Accuracy92.7	60
GUI Grounding	MMBench-GUI-L2	Accuracy83.3	43

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord