Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

About

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy66.7
1285
Chart Question AnsweringChartQA
Accuracy70
356
Document Visual Question AnsweringDocVQA
ANLS80.7
263
Text-based Visual Question AnsweringTextVQA (val)
Accuracy66.7
262
Document Visual Question AnsweringDocVQA (test)
ANLS80.7
213
Chart Question AnsweringChartQA (test)--
176
Information Visual Question AnsweringInfoVQA (test)
ANLS46.4
130
Table Question AnsweringWTQ
Accuracy36.5
101
Image CaptioningTextCaps
CIDEr131.8
96
Fact VerificationTabFact
Accuracy78.2
83
Showing 10 of 33 rows

Other info

Code

Follow for update