Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
About
CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU16.4 | 2731 | |
| Semantic segmentation | ADE20K | mIoU17 | 936 | |
| Semantic segmentation | Cityscapes | -- | 578 | |
| Semantic segmentation | Cityscapes (val) | mIoU21.3 | 332 | |
| Semantic segmentation | COCO Stuff | mIoU0.241 | 195 | |
| Semantic segmentation | Pascal Context 59 | mIoU34.9 | 164 | |
| Semantic segmentation | COCO Stuff (val) | mIoU22.8 | 126 | |
| Semantic segmentation | PASCAL-Context 59 class (val) | mIoU33.8 | 125 | |
| Semantic segmentation | Pascal VOC 20 | mIoU81.2 | 105 | |
| Semantic segmentation | Pascal VOC 21 classes (val) | mIoU53 | 103 |