VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
About
We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-context document understanding | MMLongBench-Doc | Accuracy18.4 | 58 | |
| Document Visual Question Answering | SlideVQA | Accuracy0.8 | 53 | |
| Multi-page Document Question Answering | MP-DocVQA | ANLS62.6 | 38 | |
| Slide Question Answering | SlideVQA | Overall Score65.2 | 29 | |
| End-to-end Question Answering | FinSlides | Overall Score83.5 | 25 | |
| End-to-end Question Answering | TechSlides | Overall Score67 | 25 | |
| Document Question Answering | M3DocVQA | Exact Match24.4 | 24 | |
| Multi-page Document Understanding | DUDE | ANLS44 | 21 | |
| Document Question Answering | SlideVQA (test) | -- | 19 | |
| Document Understanding | MPDocVQA | ANLS62.6 | 15 |