Enhancing Trustworthy GUI Grounding via Self-Critiqued Reinforcement Learning

About

Autonomous graphical user interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement learning (RL), often provide confidence signals that are poorly aligned with actual grounding correctness, leading to overconfident and unreliable predictions. To address this, we propose HyperClick, a novel framework that enhances trustworthy GUI grounding through self-critiqued reinforcement learning (SCRL). HyperClick combines a correctness reward and a confidence alignment reward, training the policy model to output both a click prediction and an explicit confidence estimate. This approach jointly optimizes grounding accuracy and confidence reliability through confidence-based self-assessment. Extensive experiments on challenging benchmarks show that HyperClick maintains strong grounding performance while providing better-aligned confidence estimates. By exposing uncertainty alongside GUI actions, HyperClick supports confidence-based abstention in GUI automation. Code will be released here.

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan• 2025

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score48.2	482
GUI Grounding	ScreenSpot v2	Avg Accuracy93.7	447
GUI Grounding	ScreenSpot Pro	Accuracy48.2	221
GUI Grounding	ScreenSpot	Avg Acc91.5	169
GUI Grounding	MMBench-GUI-L2	Accuracy79.6	63
GUI Grounding	UI-Vision (test)	Basic Score35.3	59
GUI Grounding	UI-Vision	Accuracy25.7	38
Visual Grounding	ScreenSpot-Pro 1.0 (test)	Development Score46.9	27
GUI Grounding	UI-I2E-Bench	--	17
GUI Grounding	CAGUI (test)	Fun2Point82.7	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord