Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aria-UI: Visual Grounding for GUI Instructions

About

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research at https://ariaui.github.io.

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, Junnan Li• 2024

Related benchmarks

TaskDatasetResultRank
GUI GroundingScreenSpot Pro
Average Score1.13e+3
307
GUI Agent TaskAndroidWorld
Success Rate44.8
136
Mobile Task AutomationAndroidWorld (test)
Average Success Rate0.448
119
Mobile GUI AutomationGUI-Odyssey
Success Rate (SR)36.5
62
GUI Action ExecutionGUI-EDA
Acoustic Score (COMSOL)49
60
Computer UseOSWorld
OS Success Rate25
42
GUI NavigationAndroidWorld latest (test)
Success Rate44.8
35
GroundingScreenSpot Pro
Average Grounding Accuracy11.3
33
GUI planningAndroidControl Low
SR (%)67.3
31
UI GroundingUI-Vision
Basic Overall Score12.2
24
Showing 10 of 22 rows

Other info

Code

Follow for update