Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GTA1: GUI Test-time Scaling Agent

About

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks. The code and models are released here.

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score6.36e+3
307
GUI GroundingScreenSpot v2
Avg Accuracy95.2
283
GUI GroundingScreenSpot Pro
Accuracy63.6
163
GUI GroundingOSWorld-G
Average Score66.7
107
GUI GroundingMMBench-GUI L2 (test)
Average Error78.5
67
GUI GroundingUI-Vision
Average Score25.7
59
GUI GroundingScreenSpot Mobile V2
Text Accuracy99.7
55
GUI GroundingScreenSpot Desktop V2
Text Accuracy99
55
GUI GroundingScreenSpot Web V2
Text Accuracy95.7
55
GUI GroundingOSWorld-G (test)
Element Accuracy78.4
52
Showing 10 of 23 rows

Other info

Follow for update