\textsc{GUI-Spotlight}: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding

About

Multimodal large language models (MLLMs) have markedly expanded the competence of graphical user-interface (GUI) systems, propelling them beyond controlled simulations into complex, real-world environments across diverse platforms. However, practical usefulness is still bounded by the reliability of visual grounding, i.e., mapping textual references to exact on-screen elements. This limitation prevents the system from accurately performing pointer-level actions such as clicking or dragging. To address it, we introduce GUI-Spotlight -- a model trained for image-grounded reasoning that dynamically invokes multiple specialized tools to iteratively narrow its focus to the relevant region of the screen, thereby substantially improving visual grounding accuracy. On the ScreenSpot-Pro benchmark, GUI-Spotlight trained with only 18.5K training samples achieves 52.8\% accuracy, surpassing V2P-7B (50.6\% with 9.6M training samples) and GTA-1-7B (50.1\% with 1.56M training samples).

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, Caiwen Ding• 2025

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score5.28e+3	482
GUI Grounding	OSWorld-G	Average Score62.7	164
GUI Grounding	UI-Vision (test)	Basic Score32.1	59
Visual Grounding	ScreenSpot-Pro 1.0 (test)	Development Score53.3	27
GUI Grounding	UI-Vision Basic	Top-1 Accuracy32.1	14
GUI Grounding	UI-Vision Functional	Top-1 Accuracy30.2	14
GUI Grounding	UI-Vision Spatial	Top-1 Accuracy9.1	14

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord