MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
About
Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy25.3 | 204 | |
| Diagram Understanding | AI2D (test) | Accuracy36.6 | 131 | |
| Object Hallucination Evaluation | POPE (test) | Accuracy84.5 | 79 | |
| Science Question Answering | ScienceQA IMG (test) | Accuracy57.3 | 74 | |
| Multimodal Evaluation | MME (test) | Perception Score1.20e+3 | 13 | |
| Multimodal Benchmarking | MMBench (dev) | Accuracy59.6 | 6 |