CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

About

CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. Our key finding is that CLIP does not lack binding information. Through linear probing, robustness tests with increasing object counts, and conjunctive search experiments, we show that attribute-object bindings are already encoded within CLIP's text and image embeddings. The weakness lies in the cross-modal alignment, which fails to preserve this information. We show it can be accessed cross-modally with a simple linear transformation to text embeddings. This improves CLIP's attribute-object binding performance and confirms that the information was already encoded unimodally. In practice, this means CLIP-based systems can be enhanced with a lightweight linear layer trained on existing embeddings, avoiding costly encoder retraining. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.

Darina Koishigarina, Arnas Uselis, Seong Joon Oh• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Retrieval	Flickr30K 1K (test)	R@163.36	432
Text-to-Image Retrieval	MSCOCO 5K (test)	R@139.78	312
Image-Text Retrieval	COCO (test)	Recall@141	41
Compositional Evaluation	SugarCrepe swap att (test)	Accuracy74.62	27
Compositional Evaluation	SugarCrepe (test)	--	20
Spatial Reasoning	What’sUp	Accuracy54	13
Compositional Evaluation	ARO-A (test)	Accuracy68.49	13
Compositional Evaluation	ABC-6K (test)	Accuracy0.6891	12
Cross-modal binding	PUG:SPAR (train)	Accuracy100	8
Cross-modal binding	CLEVR (train)	Accuracy100	4

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord