Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Enhancing Trustworthy GUI Grounding via Self-Critiqued Reinforcement Learning

About

Autonomous graphical user interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement learning (RL), often provide confidence signals that are poorly aligned with actual grounding correctness, leading to overconfident and unreliable predictions. To address this, we propose HyperClick, a novel framework that enhances trustworthy GUI grounding through self-critiqued reinforcement learning (SCRL). HyperClick combines a correctness reward and a confidence alignment reward, training the policy model to output both a click prediction and an explicit confidence estimate. This approach jointly optimizes grounding accuracy and confidence reliability through confidence-based self-assessment. Extensive experiments on challenging benchmarks show that HyperClick maintains strong grounding performance while providing better-aligned confidence estimates. By exposing uncertainty alongside GUI actions, HyperClick supports confidence-based abstention in GUI automation. Code will be released here.

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score48.2
458
GUI GroundingScreenSpot v2
Avg Accuracy93.7
371
GUI GroundingScreenSpot Pro
Accuracy48.2
195
GUI GroundingScreenSpot
Avg Acc91.5
160
GUI GroundingUI-Vision (test)
Basic Score35.3
59
GUI GroundingMMBench-GUI-L2
Accuracy79.6
43
Visual GroundingScreenSpot-Pro 1.0 (test)
Development Score46.9
27
GUI GroundingUI-I2E-Bench
Accuracy (Mobile)80.4
13
GUI GroundingUI-Vision
Accuracy25.7
11
GUI GroundingCAGUI (test)
Fun2Point82.7
10
Showing 10 of 10 rows

Other info

Follow for update