jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

About

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael G\"unther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao• 2024

Related benchmarks

Task	Dataset	Result
Visual document retrieval	ViDoRe V2	Avg nDCG@528.5	39
Multimodal Retrieval	Multi30K (test)	--	35
Multimodal-to-text retrieval	MM-BRIGHT	Acad Score22.3	24
Visual document retrieval	ViDoRe V3	--	23
Visual document retrieval	IRPAPERS	NDCG@526.6	22
Visual document retrieval	VisDoc OOD	NDCG@547.2	22
Text-to-Chart Retrieval	VisText L1 Caption	R@590.93	12
Text-to-Chart Retrieval	VisText L2+L3 Caption	R@50.7449	12
Text-to-Chart Retrieval	Chart-To-Text (test)	R@583.78	12
Text-to-Chart Retrieval	CRBench Precise Query	R@14.1	12

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord