LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

About

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-shot and supervised image segmentation, object detection, and serving as a tokenizer backbone for multimodal large-model benchmarks. Code and models are available at: https://aka.ms/llm2clip

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu• 2024

Related benchmarks

Task	Dataset	Result
Image-to-Text Retrieval	Flickr30K 1K (test)	--	491
Text-to-Image Retrieval	Flickr30K 1K (test)	--	432
Image-to-Text Retrieval	DOCCI (test)	Recall@187.8	43
Multimodal Retrieval (image query to multimodal content)	M5Product (test)	Recall@130.1	23
Text-to-Image Retrieval	Urban1k (test)	R@191.1	10
Coarse-grained Product Retrieval	M5Product (test)	mAP@168.3	10
Text-to-Image Retrieval	M5Product (test)	Recall@127.7	10
Coarse-grained Product Retrieval	EIPM 200,000 products (test)	mAP@167.4	7
Image-to-Text Retrieval	EIPM 200,000 products (test)	Recall@127.5	7
Text-to-Image Retrieval	EIPM 200,000 products (test)	Recall@127.6	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord