COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

About

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy83.2	2019
Visual Question Answering	GQA	Accuracy60.4	1425
Semantic segmentation	ADE20K	mIoU17.7	1028
Text-based Visual Question Answering	TextVQA	Accuracy55.3	962
Semantic segmentation	Cityscapes	mIoU34.7	668
Image Classification	Stanford Cars	--	660
Image Classification	DTD	--	599
Image Classification	Food-101	--	570
Image Classification	CIFAR-10	--	564
Text-to-Image Retrieval	Flickr30K	R@176.1	559

Showing 10 of 57 rows

Other info

Code

Follow for update

@wizwand_team Discord