Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
About
Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Retrieval | HotpotQA | Recall52.4 | 31 | |
| Visual document retrieval | DocVQA | Recall@1096.06 | 13 | |
| Visual document retrieval | PlotQA | Recall@1064.82 | 13 | |
| Visual document retrieval | SlideVQA | Recall@1097.87 | 13 | |
| Web-Page Retrieval | NQ | Recall75.8 | 13 | |
| Web-Page Retrieval | TriviaQA | Recall78.75 | 13 | |
| Web-Page Retrieval | WebQ | Recall75.81 | 13 | |
| Web-Page Retrieval | 2WikiMultihopQA | Recall40.57 | 13 | |
| Web-Page Retrieval | ASQA | Recall84.13 | 13 | |
| Web-Page Retrieval | Web-Page Retrieval Average | Recall67.91 | 13 |