Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

About

Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong• 2024

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score22.9
307
GUI GroundingScreenSpot v2
Avg Accuracy86
283
GUI GroundingScreenSpot Pro
Accuracy23.6
163
GUI Agent TaskAndroidWorld
Success Rate37.1
136
GUI GroundingScreenSpot
Avg Acc89.2
133
Mobile Task AutomationAndroidWorld (test)
Average Success Rate0.371
119
GUI GroundingOSWorld-G
Average Score38.7
107
GUI GroundingMMBench-GUI L2 (test)
Average Error45.7
67
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)13.5
62
GUI Action ExecutionGUI-EDA
Acoustic Score (COMSOL)53
60
Showing 10 of 57 rows

Other info

Follow for update