Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

About

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on seven segmentation datasets. Our proposed CLIPer achieves the state-of-the-art performance on these datasets. For instance, using ViT-L, CLIPer has the mIoU of 69.8% and 43.3% on VOC and COCO Object, outperforming ProxyCLIP by 9.2% and 4.1% respectively.

Lin Sun, Jiale Cao, Jin Xie, Xiaoheng Jiang, Yanwei Pang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU24.4
1028
Semantic segmentationADE20K
mIoU24.4
559
Semantic segmentationCityscapes
mIoU41.6
494
Semantic segmentationCOCO Stuff
mIoU0.287
399
Semantic segmentationADE20K A-150
mIoU24.4
224
Semantic segmentationPascal Context 59
mIoU43.6
204
Semantic segmentationLoveDA
mIoU43.84
192
Semantic segmentationVaihingen
mIoU43.97
156
Semantic segmentationiSAID
mIoU27.36
146
Semantic segmentationPascal Context 60
mIoU3.80e+3
139
Showing 10 of 33 rows

Other info

Follow for update