Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

About

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Information RetrievalHotpotQA
Recall52.4
31
Visual document retrievalDocVQA
Recall@1096.06
13
Visual document retrievalPlotQA
Recall@1064.82
13
Visual document retrievalSlideVQA
Recall@1097.87
13
Web-Page RetrievalNQ
Recall75.8
13
Web-Page RetrievalTriviaQA
Recall78.75
13
Web-Page RetrievalWebQ
Recall75.81
13
Web-Page Retrieval2WikiMultihopQA
Recall40.57
13
Web-Page RetrievalASQA
Recall84.13
13
Web-Page RetrievalWeb-Page Retrieval Average
Recall67.91
13
Showing 10 of 12 rows

Other info

Follow for update