Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

About

Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a two-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git.

Ziqin Zhou, Bowen Zhang, Yinjie Lei, Lingqiao Liu, Yifan Liu• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU21.1
2731
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU94.2
2040
Semantic segmentationPASCAL VOC 2012 (test)
mIoU94.1
1342
Semantic segmentationPASCAL Context (val)
mIoU45.8
323
Semantic segmentationPascal Context (test)
mIoU45.8
176
Semantic segmentationPASCAL-Context 59 class (val)
mIoU41.2
125
Semantic segmentationPascal VOC 20
mIoU93.6
105
Medical Image SegmentationMedical Image Segmentation Aggregate (Average of BUSI, BTMRI, ISIC, Kvasir-SEG, QaTa-COV19, and EUS) (test)
DSC78.98
80
Medical Image SegmentationCVC-ClinicDB
Dice Score69.75
68
Medical Image SegmentationISIC
DICE81.45
64
Showing 10 of 37 rows

Other info

Code

Follow for update