Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Docopilot: Improving Multimodal Models for Document-Level Understanding

About

Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, Jifeng Dai, Wenhai Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista
Score45
474
Visual Mathematical ReasoningMathVista
Accuracy45
366
Massive Multi-discipline Multimodal UnderstandingMMMU--
216
Information Visual Question AnsweringInfoVQA
Accuracy75
110
Long-context document understandingMMLongBench-Doc
Accuracy28.8
58
Multi-modal ReasoningEMMA
Accuracy12.1
57
Document Visual Question AnsweringSlideVQA
Accuracy0.357
53
Visual PerceptionV*
Score40.1
42
Multi-page Document Question AnsweringMP-DocVQA
ANLS81.3
38
Document UnderstandingDUDE
Accuracy40.7
32
Showing 10 of 21 rows

Other info

Follow for update