Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

About

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

Andreas Koukounas, Georgios Mastrapas, Sedigheh Eslami, Bo Wang, Mohammad Kalim Akram, Michael G\"unther, Isabelle Mohr, Saba Sturua, Nan Wang, Han Xiao• 2024

Related benchmarks

TaskDatasetResultRank
Visual document retrievalViDoRe V2
Avg nDCG@528.5
39
Multimodal RetrievalMulti30K (test)--
35
Multimodal-to-text retrievalMM-BRIGHT
Acad Score22.3
24
Visual document retrievalViDoRe V3--
23
Visual document retrievalIRPAPERS
NDCG@526.6
22
Visual document retrievalVisDoc OOD
NDCG@547.2
22
Text-to-Chart RetrievalVisText L1 Caption
R@590.93
12
Text-to-Chart RetrievalVisText L2+L3 Caption
R@50.7449
12
Text-to-Chart RetrievalChart-To-Text (test)
R@583.78
12
Text-to-Chart RetrievalCRBench Precise Query
R@14.1
12
Showing 10 of 16 rows

Other info

Follow for update