Harnessing Webpage UIs for Text-Rich Visual Understanding

About

Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web agent dataset Mind2Web-but also generalize surprisingly well to non-web UI tasks and even to non-UI domains, such as document understanding, OCR, and chart interpretation. These results highlight the broad applicability of web UI data for advancing text-rich visual understanding across various scenarios.

Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue• 2024

Related benchmarks

Task	Dataset	Result
Web agent tasks	Mind2Web Cross-Task	Step Success Rate38.2	64
Web agent tasks	Mind2Web (Cross-Website)	Element Accuracy39.2	40
Web agent tasks	Mind2Web Cross-Domain	Ele.Acc40.4	37
Webpage Question Answering	VisualWebBench MultiUI-WQA	Accuracy86.6	11

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord