mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

About

Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy58.6	1455
Text-based Visual Question Answering	TextVQA	Accuracy68.6	984
Mathematical Reasoning	MathVista	Score50.7	566
Visual Mathematical Reasoning	MathVista	Accuracy50.7	448
Chart Question Answering	ChartQA	Accuracy70.2	404
Visual Question Answering	TextVQA (val)	VQA Score68.6	371
OCR Evaluation	OCRBench	Score599	350
Document Visual Question Answering	DocVQA	ANLS82.2	301
Document Visual Question Answering	DocVQA (test)	ANLS82.2	292
Text-based Visual Question Answering	TextVQA (val)	Accuracy68.8	276

Showing 10 of 76 rows

...

Other info

Follow for update

@wizwand_team Discord