Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation

About

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose an open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, which are computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k. The code to reproduce our results is available at https://github.com/wysoczanska/clip_dinoiser.

Monika Wysocza\'nska, Oriane Sim\'eoni, Micha\"el Ramamonjisoa, Andrei Bursuc, Tomasz Trzci\'nski, Patrick P\'erez• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU20
2731
Semantic segmentationADE20K
mIoU20
936
Semantic segmentationCityscapes
mIoU31.7
578
Semantic segmentationCityscapes (val)
mIoU31.7
332
Semantic segmentationCOCO Stuff
mIoU24.6
195
Semantic segmentationPascal VOC
mIoU0.622
172
Semantic segmentationCOCO Stuff (val)
mIoU24.6
126
Semantic segmentationPASCAL-Context 59 class (val)
mIoU35.9
125
Semantic segmentationPascal VOC 20
mIoU80.2
105
Semantic segmentationCOCO Object (val)
mIoU0.348
77
Showing 10 of 48 rows

Other info

Follow for update