Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval

About

This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes of this work are available at https://github.com/OpenMatch/UniVL-DR.

Zhenghao Liu, Chenyan Xiong, Yuanhuiyi Lv, Zhiyuan Liu, Ge Yu• 2022

Related benchmarks

Task	Dataset	Result
Text-to-Chart Retrieval	CRBench Precise Query	R@19.74	12
Text-to-Chart Retrieval	VisText L1 Caption	R@588.1	12
Text-to-Chart Retrieval	VisText L2+L3 Caption	R@50.6474	12
Text-to-Chart Retrieval	CRBench Fuzzy Query	R@15.34	12
Text-to-Chart Retrieval	Chart-To-Text (test)	R@582.56	12
Multi-modal Retrieval (ALL->T)	EVQA+	R@130.48	7
Multi-modal Retrieval (T->All)	WebQA+	Recall@136.81	7
Multimodal Evidence Retrieval	Mocheg 2023 (test)	Recall@2053.74	7
Multimodal Evidence Retrieval	MMCV 2025 (test)	Recall@2055.24	7
Retrieval	WebQA (test)	Recall@564.5	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord