Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

About

Vision-Language Models (VLMs), such as CLIP, exhibit strong image-text comprehension abilities, facilitating advances in several downstream tasks such as zero-shot image classification, image-text retrieval, and text-to-image generation. However, the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally, the current contrastive learning objective fails to focus on fine-grained grounding components like relations, actions, and attributes, resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. We open-source our code at https://github.com/lezhang7/Enhance-FineGrained.

Le Zhang, Rabiul Awal, Aishwarya Agrawal• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100	Accuracy78.1	691
Text-to-Image Retrieval	Flickr30k (test)	Recall@168.3	525
Image-to-Text Retrieval	Flickr30k (test)	R@174.9	472
Image Classification	CIFAR100	Accuracy60.2	301
Image Classification	CIFAR10	Accuracy (%)85.9	282
Image Classification	ImageNet-1K	Accuracy94.2	199
Text-to-Image Retrieval	MS-COCO	--	187
Image-to-Text Retrieval	MS-COCO	--	168
Aggregate Model Performance	Combined Benchmark Suite	Average Score60.5	57
Compositional Reasoning	SugarCrepe	Overall Accuracy87.5	50

Showing 10 of 45 rows

Other info

Code

Follow for update

@wizwand_team Discord