Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

About

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

Andreas Koukounas, Georgios Mastrapas, Michael G\"unther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Mart\'inez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao• 2024

Related benchmarks

TaskDatasetResultRank
Visual document retrievalViDoRe Avg. across 4 datasets v2--
45
Multimodal-to-text retrievalMM-BRIGHT
Acad Score22.3
24
Visual document retrievalViDoRe 8 datasets v3
NDCG@520.7
14
Visual document retrievalViDoRe 10 datasets v1
NDCG@553.7
14
Document RetrievalDocHaystack 100
Recall@116.51
7
Document RetrievalDocHaystack-1000
Recall@13.67
7
Document RetrievalInfoHaystack 100
Recall@143.23
7
Document RetrievalInfoHaystack 200
Recall@136.77
7
Document RetrievalInfoHaystack-1000
Recall@123.87
7
Document RetrievalDocHaystack 200
Recall@19.17
7
Showing 10 of 10 rows

Other info

Follow for update