jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
About
Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Retrieval | Multi30K (test) | -- | 35 | |
| Multimodal-to-text retrieval | MM-BRIGHT | Acad Score22.3 | 24 | |
| Text-to-Chart Retrieval | VisText L1 Caption | R@590.93 | 12 | |
| Text-to-Chart Retrieval | VisText L2+L3 Caption | R@50.7449 | 12 | |
| Text-to-Chart Retrieval | Chart-To-Text (test) | R@583.78 | 12 | |
| Text-to-Chart Retrieval | CRBench Precise Query | R@14.1 | 12 | |
| Text-to-Chart Retrieval | CRBench Fuzzy Query | R@13.05 | 12 | |
| Text-to-Image Retrieval | XTD10 (test) | R@1092.2 | 7 | |
| Text-to-Text Retrieval | COCO-QLTI | R@1082.8 | 6 |