Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Harnessing Webpage UIs for Text-Rich Visual Understanding

About

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue• 2024

Related benchmarks

TaskDatasetResultRank
Web agent tasksMind2Web Cross-Task
Element Accuracy43.4
49
Web agent tasksMind2Web (Cross-Website)
Element Accuracy39.2
40
Web agent tasksMind2Web Cross-Domain
Ele.Acc40.4
37
Showing 3 of 3 rows

Other info

Follow for update