MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

About

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.

Xiaoyi Dong, Jianmin Bao, Yinglin Zheng, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU50.5	3069
Instance Segmentation	COCO 2017 (val)	APm0.409	1275
Semantic segmentation	ADE20K	mIoU11.9	1028
Image Classification	ImageNet-1k (val)	Top-1 Accuracy83.6	708
Semantic segmentation	Cityscapes	mIoU24.9	668
Semantic segmentation	ADE20K	mIoU50.5	559
Text-to-Image Retrieval	Flickr30K	R@145.6	559
Semantic segmentation	COCO Stuff	mIoU16.7	399
Semantic segmentation	PASCAL Context (val)	mIoU16.8	360
Object Detection	MS-COCO 2017 (val)	--	264

Showing 10 of 62 rows

Other info

Code

Follow for update

@wizwand_team Discord