VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

About

We aim to develop a retrieval-augmented generation (RAG) framework that answers questions over a corpus of visually-rich documents presented in mixed modalities (e.g., charts, tables) and diverse formats (e.g., PDF, PPTX). In this paper, we introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format to prevent missing information that occurs by parsing documents to obtain text. To improve the performance, we propose novel self-supervised pre-training tasks that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents. Furthermore, we introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats. OpenDocVQA provides a comprehensive resource for training and evaluating retrieval and question answering models on visually-rich documents in an open-domain setting. Experiments show that VDocRAG substantially outperforms conventional text-based RAG and has strong generalization capability, highlighting the potential of an effective RAG paradigm for real-world documents.

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Jun Suzuki• 2025

Related benchmarks

Task	Dataset	Result
Long-context document understanding	MMLongBench-Doc	Accuracy18.4	58
Document Visual Question Answering	SlideVQA	Accuracy0.8	53
Multi-page Document Question Answering	MP-DocVQA	ANLS62.6	38
Slide Question Answering	SlideVQA	Overall Score65.2	29
End-to-end Question Answering	FinSlides	Overall Score83.5	25
End-to-end Question Answering	TechSlides	Overall Score67	25
Document Question Answering	M3DocVQA	Exact Match24.4	24
Multi-page Document Understanding	DUDE	ANLS44	21
Document Question Answering	SlideVQA (test)	--	19
Document Understanding	MPDocVQA	ANLS62.6	15

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord