Enhancing Trustworthy GUI Grounding via Self-Critiqued Reinforcement Learning
About
Autonomous graphical user interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement learning (RL), often provide confidence signals that are poorly aligned with actual grounding correctness, leading to overconfident and unreliable predictions. To address this, we propose HyperClick, a novel framework that enhances trustworthy GUI grounding through self-critiqued reinforcement learning (SCRL). HyperClick combines a correctness reward and a confidence alignment reward, training the policy model to output both a click prediction and an explicit confidence estimate. This approach jointly optimizes grounding accuracy and confidence reliability through confidence-based self-assessment. Extensive experiments on challenging benchmarks show that HyperClick maintains strong grounding performance while providing better-aligned confidence estimates. By exposing uncertainty alongside GUI actions, HyperClick supports confidence-based abstention in GUI automation. Code will be released here.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot Pro | Average Score48.2 | 458 | |
| GUI Grounding | ScreenSpot v2 | Avg Accuracy93.7 | 371 | |
| GUI Grounding | ScreenSpot Pro | Accuracy48.2 | 195 | |
| GUI Grounding | ScreenSpot | Avg Acc91.5 | 160 | |
| GUI Grounding | UI-Vision (test) | Basic Score35.3 | 59 | |
| GUI Grounding | MMBench-GUI-L2 | Accuracy79.6 | 43 | |
| Visual Grounding | ScreenSpot-Pro 1.0 (test) | Development Score46.9 | 27 | |
| GUI Grounding | UI-I2E-Bench | Accuracy (Mobile)80.4 | 13 | |
| GUI Grounding | UI-Vision | Accuracy25.7 | 11 | |
| GUI Grounding | CAGUI (test) | Fun2Point82.7 | 10 |