mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

About

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	TextVQA	Accuracy66.7	1453
Visual Question Answering	ChartQA	--	519
Chart Question Answering	ChartQA	Accuracy70	371
Document Visual Question Answering	DocVQA	ANLS80.7	301
Document Visual Question Answering	DocVQA (test)	ANLS80.7	292
Text-based Visual Question Answering	TextVQA (val)	Accuracy66.7	276
Chart Question Answering	ChartQA (test)	--	190
Information Visual Question Answering	InfoVQA (test)	ANLS46.4	130
Infographic Question Answering	InfoVQA	ANLS46.4	117
Image Captioning	TextCaps	CIDEr131.8	112

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord