Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpiritSight Agent: Advanced GUI Agent with One Look

About

Graphical User Interface (GUI) agents show amazing abilities in assisting human-computer interaction, automating human user's navigation on digital devices. An ideal GUI agent is expected to achieve high accuracy, low latency, and compatibility for different GUI platforms. Recent vision-based approaches have shown promise by leveraging advanced Vision Language Models (VLMs). While they generally meet the requirements of compatibility and low latency, these vision-based GUI agents tend to have low accuracy due to their limitations in element grounding. To address this issue, we propose $\textbf{SpiritSight}$, a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. First, we create a multi-level, large-scale, high-quality GUI dataset called $\textbf{GUI-Lasagne}$ using scalable methods, empowering SpiritSight with robust GUI understanding and grounding capabilities. Second, we introduce the $\textbf{Universal Block Parsing (UBP)}$ method to resolve the ambiguity problem in dynamic high-resolution of visual inputs, further enhancing SpiritSight's ability to ground GUI objects. Through these efforts, SpiritSight agent outperforms other advanced methods on diverse GUI benchmarks, demonstrating its superior capability and compatibility in GUI navigation tasks. Models and datasets are available at https://hzhiyuan.github.io/SpiritSight-Agent.

Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, Mingjie Zhan• 2025

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot--
76
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)75.8
50
GUI NavigationMultimodal-Mind2Web Cross-Website
Step Success Rate48.1
32
GUI NavigationMultimodal-Mind2Web Cross-Task
Step Success Rate54.7
27
GUI NavigationMultimodal-Mind2Web Cross-Domain
Step Success Rate49.2
27
GUI AgentAC Low
Success Rate87.6
16
GUI AgentAC High
SR68.1
16
Mobile Agent NavigationGUI Odyssey 1.0 (test)
Step Success Rate (SR)75.8
15
Mobile Agent NavigationAndroidControl 1.0 (test)
Step Success Rate (SR)68.1
15
GUI NavigationGUI-Odyssey High
Action Matching Score (AMS)75.8
4
Showing 10 of 15 rows

Other info

Code

Follow for update