Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

About

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang• 2026

Related benchmarks

Task	Dataset	Result
Information Retrieval	HotpotQA	Recall52.4	31
Visual document retrieval	DocVQA	Recall@1096.06	13
Visual document retrieval	PlotQA	Recall@1064.82	13
Visual document retrieval	SlideVQA	Recall@1097.87	13
Web-Page Retrieval	NQ	Recall75.8	13
Web-Page Retrieval	TriviaQA	Recall78.75	13
Web-Page Retrieval	WebQ	Recall75.81	13
Web-Page Retrieval	2WikiMultihopQA	Recall40.57	13
Web-Page Retrieval	ASQA	Recall84.13	13
Web-Page Retrieval	Web-Page Retrieval Average	Recall67.91	13

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord