Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GUICourse: From General Vision Language Models to Versatile GUI Agents

About

Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, Yuan Yao, Yankai Lin, Zhiyuan Liu, Maosong Sun• 2024

Related benchmarks

TaskDatasetResultRank
Web agent tasksMind2Web Cross-Task
Element Accuracy23.8
49
Web agent tasksMind2Web (Cross-Website)
Element Accuracy20.3
40
Web agent tasksMind2Web Cross-Domain
Ele.Acc17.9
37
GUI NavigationMultimodal-Mind2Web Cross-Website
Step Success Rate29.7
32
GUI NavigationAITW (test)
Install Success Rate46.5
27
GUI NavigationMultimodal-Mind2Web Cross-Task
Step Success Rate20.8
27
GUI NavigationMultimodal-Mind2Web Cross-Domain
Step Success Rate14.6
27
GUI NavigationMind2Web (Cross-Website)
Element Accuracy20.3
23
GUI NavigationWeb-Multi (test)
Type EM73.1
14
GUI NavigationSmartphone (test)
Type EM76.1
14
Showing 10 of 20 rows

Other info

Code

Follow for update