CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
About
Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO (val) | -- | 633 | |
| Text-to-Image Retrieval | Flickr30K | R@135 | 531 | |
| Object Detection | LVIS v1.0 (val) | -- | 529 | |
| Semantic segmentation | PASCAL Context (val) | mIoU62.3 | 360 | |
| Semantic segmentation | PC-59 | mIoU62.3 | 148 | |
| Object Detection | OV-COCO | AP50 (Novel)44.3 | 130 | |
| Instance Segmentation | LVIS | mAP (Mask)28.7 | 81 | |
| Semantic segmentation | ADE20K A-847 (val) | mIoU12.4 | 70 | |
| Object Detection | Objects365 (val) | mAP23.7 | 60 | |
| Open-vocabulary object detection | OV-LVIS v1.0 (test) | APr35 | 50 |