Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

About

The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO. The code is available at http://github.com/wysoczanska/clip-diy

Monika Wysocza\'nska, Micha\"el Ramamonjisoa, Tomasz Trzci\'nski, Oriane Sim\'eoni• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU59.9
2142
Semantic segmentationPASCAL Context (val)
mIoU19.7
360
Semantic segmentationPascal VOC (test)
mIoU59.9
236
Semantic segmentationPascal Context
mIoU19.7
217
Semantic segmentationCOCO Object
mIoU31
129
Open Vocabulary Semantic SegmentationPascal VOC 20
mIoU79.7
104
Semantic segmentationCOCO Object (val)
mIoU0.31
97
Open Vocabulary Semantic SegmentationADE20K without background
mIoU9.9
72
Open Vocabulary Semantic SegmentationCOCO Stuff without background
mIoU13.3
71
Open Vocabulary Semantic SegmentationPASCAL Context Context60 with background
mIoU19.7
69
Showing 10 of 22 rows

Other info

Follow for update