Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

About

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score58.1
458
GUI GroundingScreenSpot v2
Avg Accuracy93.9
371
GUI GroundingOSWorld-G
Average Score65.6
144
GUI GroundingUI-Vision
Average Score27.3
68
Visual GroundingScreenSpot-Pro 1.0 (test)
Development Score57.5
27
Spatial ReasoningSpatialMQA
Accuracy43.4
9
GUI NavigationOSWorld Online evaluation
Accuracy (50 steps)57.1
3
Spatial ReasoningSphere
Single Skill Performance71.2
3
Showing 8 of 8 rows

Other info

Follow for update