Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

About

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.

Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu• 2024

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot v2	Avg Accuracy51.2	371
GUI Grounding	OSWorld-G	Average Score25.3	144
GUI Grounding	OSWorld-G (test)	Element Accuracy25.3	52
GUI Grounding	ScreenSpot-Pro (test)	Element Accuracy3	43
GUI Grounding	ScreenSpot (test)	Element Accuracy57.5	42
GUI Grounding	ScreenSpot v1 (test)	Mobile Text Acc88.3	25
GUI Grounding	ScreenSpot v2 (test)	Element Accuracy51.2	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord