SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

About

Contrastive Language-Image Pre-training (CLIP)~\citep{radford2021learning} has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only \emph{preserve} cross-modal semantic information in its entirety but also \emph{disentangle} visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.

Shaoan Xie, Lingjing Kong, Yujia Zheng, Yu Yao, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet V2	--	767
Text-to-Image Retrieval	Flickr30K	R@143.8	607
Image-to-Text Retrieval	Flickr30K	R@163.9	451
Image Classification	SUN397	Accuracy72.1	450
Image Classification	FGVC Aircraft	Accuracy30.4	223
Text-to-Image Retrieval	COCO	Recall@148.5	161
Image-to-Text Retrieval	COCO	R@166	152
Text-to-Image Retrieval	DCI	R@169.88	117
Image-to-Text Retrieval	DCI	R@170.94	111
Image Classification	FER 2013	Top-1 Acc0.586	107

Showing 10 of 48 rows

Other info

Follow for update

@wizwand_team Discord