Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

About

Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong• 2024

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score22.9	458
GUI Grounding	ScreenSpot v2	Avg Accuracy87.3	371
GUI Grounding	ScreenSpot Pro	Accuracy23.6	195
GUI Agent Task	AndroidWorld	Success Rate37.1	188
GUI Grounding	ScreenSpot	Avg Acc89.2	160
GUI Grounding	OSWorld-G	Average Score38.7	144
Mobile Task Automation	AndroidWorld (test)	Average Success Rate0.371	119
Grounding	ScreenSpot Pro	Average Grounding Accuracy36.5	82
GUI Grounding	UI-Vision	Average Score13.7	68
Web navigation	WebVoyager	Success Rate0.00e+0	68

Showing 10 of 91 rows

...

Other info

Follow for update

@wizwand_team Discord