Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

About

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.

Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU16.4	3069
Semantic segmentation	ADE20K	mIoU17	1028
Semantic segmentation	Cityscapes	--	668
Semantic segmentation	ADE20K	mIoU16.4	559
Semantic segmentation	Cityscapes (val)	mIoU21.3	527
Semantic segmentation	Cityscapes	mIoU21.3	494
Semantic segmentation	COCO Stuff	mIoU22.8	399
Semantic segmentation	Pascal Context 59	mIoU34.9	204
Semantic segmentation	PC-59	mIoU35	174
Semantic segmentation	COCO Stuff (val)	mIoU22.8	167

Showing 10 of 71 rows

...

Other info

Follow for update

@wizwand_team Discord