Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

About

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-shot and supervised image segmentation, object detection, and serving as a tokenizer backbone for multimodal large-model benchmarks. Code and models are available at: https://aka.ms/llm2clip

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu• 2024

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)--
439
Text-to-Image RetrievalFlickr30K 1K (test)--
375
Image-to-Text RetrievalDOCCI (test)
Recall@187.8
22
Text-to-Image RetrievalUrban1k (test)
R@191.1
10
Image-to-Text RetrievalUrban1k (test)--
6
Showing 5 of 5 rows

Other info

Follow for update