Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

About

CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. Our key finding is that CLIP does not lack binding information. Through linear probing, robustness tests with increasing object counts, and conjunctive search experiments, we show that attribute-object bindings are already encoded within CLIP's text and image embeddings. The weakness lies in the cross-modal alignment, which fails to preserve this information. We show it can be accessed cross-modally with a simple linear transformation to text embeddings. This improves CLIP's attribute-object binding performance and confirms that the information was already encoded unimodally. In practice, this means CLIP-based systems can be enhanced with a lightweight linear layer trained on existing embeddings, avoiding costly encoder retraining. The code is available at https://github.com/kdariina/CLIP-not-BoW-unimodally.

Darina Koishigarina, Arnas Uselis, Seong Joon Oh• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image RetrievalFlickr30K 1K (test)
R@163.36
432
Text-to-Image RetrievalMSCOCO 5K (test)
R@139.78
308
Image-Text RetrievalCOCO (test)
Recall@141
41
Compositional EvaluationSugarCrepe swap att (test)
Accuracy74.62
27
Compositional EvaluationSugarCrepe (test)--
20
Spatial ReasoningWhat’sUp
Accuracy54
13
Compositional EvaluationARO-A (test)
Accuracy68.49
13
Compositional EvaluationABC-6K (test)
Accuracy0.6891
12
Cross-modal bindingPUG:SPAR (train)
Accuracy100
8
Cross-modal bindingCLEVR (train)
Accuracy100
4
Showing 10 of 16 rows

Other info

Follow for update