Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

About

We present MM-Navigator, a GPT-4V-based agent for the smartphone graphical user interface (GUI) navigation task. MM-Navigator can interact with a smartphone screen as human users, and determine subsequent actions to fulfill given instructions. Our findings demonstrate that large multimodal models (LMMs), specifically GPT-4V, excel in zero-shot GUI navigation through its advanced screen interpretation, action reasoning, and precise action localization capabilities. We first benchmark MM-Navigator on our collected iOS screen dataset. According to human assessments, the system exhibited a 91\% accuracy rate in generating reasonable action descriptions and a 75\% accuracy rate in executing the correct actions for single-step instructions on iOS. Additionally, we evaluate the model on a subset of an Android screen navigation dataset, where the model outperforms previous GUI navigators in a zero-shot fashion. Our benchmark and detailed analyses aim to lay a robust groundwork for future research into the GUI navigation task. The project page is at https://github.com/zzxslp/MM-Navigator.

An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, Zicheng Liu, Lijuan Wang• 2023

Related benchmarks

TaskDatasetResultRank
GUI NavigationAITW (test)
Install Success Rate46.14
27
GUI GroundingScreenSpot 1.0 (full)
Mobile Text Acc0.226
6
Mobile GUI action matchingAITW instruction-wise (test)
Overall Error50.5
5
UI Task CompletionAITW
Overall Task Time50.5
5
Showing 4 of 4 rows

Other info

Follow for update