Docopilot: Improving Multimodal Models for Document-Level Understanding
About
Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathVista | Score45 | 474 | |
| Visual Mathematical Reasoning | MathVista | Accuracy45 | 366 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | -- | 216 | |
| Information Visual Question Answering | InfoVQA | Accuracy75 | 110 | |
| Long-context document understanding | MMLongBench-Doc | Accuracy28.8 | 58 | |
| Multi-modal Reasoning | EMMA | Accuracy12.1 | 57 | |
| Document Visual Question Answering | SlideVQA | Accuracy0.357 | 53 | |
| Visual Perception | V* | Score40.1 | 42 | |
| Multi-page Document Question Answering | MP-DocVQA | ANLS81.3 | 38 | |
| Document Understanding | DUDE | Accuracy40.7 | 32 |