Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

About

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, Yanwei Pang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU24.4
936
Semantic segmentationCOCO Stuff
mIoU0.287
195
Semantic segmentationADE20K A-150
mIoU24.4
188
Semantic segmentationPascal Context 59
mIoU43.6
164
Semantic segmentationLoveDA
mIoU43.84
142
Semantic segmentationPascal VOC 20
mIoU90
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU69.8
103
Semantic segmentationVaihingen
mIoU43.97
95
Semantic segmentationPascal Context 60
mIoU3.80e+3
81
Semantic segmentationCOCO Object
mIoU43.3
73
Showing 10 of 17 rows

Other info

Follow for update