Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
About
Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy51.2 | 203 | |
| GUI Grounding | OSWorld-G | Average Score25.3 | 74 | |
| GUI Grounding | OSWorld-G (test) | Element Accuracy25.3 | 52 | |
| GUI Grounding | ScreenSpot-Pro (test) | Element Accuracy3 | 43 | |
| GUI Grounding | ScreenSpot v1 (test) | Mobile Text Acc88.3 | 25 | |
| GUI Grounding | ScreenSpot (test) | Element Accuracy57.5 | 13 | |
| GUI Grounding | ScreenSpot v2 (test) | Element Accuracy51.2 | 9 |