CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

About

Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

Size Wu, Wenwei Zhang, Lumin Xu, Sheng Jin, Xiangtai Li, Wentao Liu, Chen Change Loy• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	COCO (val)	--	637
Text-to-Image Retrieval	Flickr30K	R@135	559
Object Detection	LVIS v1.0 (val)	--	542
Semantic segmentation	PASCAL Context (val)	mIoU62.3	360
Semantic segmentation	PC-59	mIoU62.3	174
Object Detection	OV-COCO	AP50 (Novel)44.3	168
Object Detection	Objects365 (val)	mAP23.7	102
Instance Segmentation	LVIS	mAP (Mask)28.7	81
Open-vocabulary object detection	OV-LVIS	AP Novel34.9	71
Semantic segmentation	ADE20K A-847 (val)	mIoU12.4	70

Showing 10 of 48 rows

Other info

Code

Follow for update

@wizwand_team Discord