Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

About

Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2.

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy66.7
1117
Chart Question AnsweringChartQA
Accuracy70
229
Document Visual Question AnsweringDocVQA
ANLS80.7
164
Table Question AnsweringWTQ
Accuracy36.5
101
Image CaptioningTextCaps
CIDEr131.8
96
Fact VerificationTabFact
Accuracy78.2
73
Document Visual Question AnsweringInfoVQA
ANLS46.4
32
Multi-page Document Question AnsweringMP-DocVQA
ANLS69.42
11
Multi-page Document Question AnsweringDUDE
ANLS46.77
11
Form UnderstandingDeepForm
Accuracy66.8
8
Showing 10 of 15 rows

Other info

Code

Follow for update