Unified Multimodal and Multilingual Retrieval via Multi-Task Learning with NLU Integration

About

Multimodal retrieval systems typically employ Vision Language Models (VLMs) that encode images and text independently into vectors within a shared embedding space. Despite incorporating text encoders, VLMs consistently underperform specialized text models on text-only retrieval tasks. Moreover, introducing additional text encoders increases storage, inference overhead, and exacerbates retrieval inefficiencies, especially in multilingual settings. To address these limitations, we propose a multi-task learning framework that unifies the feature representation across images, long and short texts, and intent-rich queries. To our knowledge, this is the first work to jointly optimize multilingual image retrieval, text retrieval, and natural language understanding (NLU) tasks within a single framework. Our approach integrates image and text retrieval with a shared text encoder that is enhanced by NLU features for intent understanding and retrieval accuracy.

Xinyuan Zhang, Lina Zhang, Lisung Chen, Guangyao Liu, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Retrieval	Multi30K (test)	--	35
Text-to-Image Retrieval	XTD10 (test)	R@1093.3	7
Text-to-Text Retrieval	COCO-QLTI	R@1083.4	6

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord