Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation

About

Despite the significant progress in deep learning for dense visual recognition problems, such as semantic segmentation, traditional methods are constrained by fixed class sets. Meanwhile, vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks, owing to their robust generalizability. Recently, a body of work has investigated utilizing these models in open-vocabulary semantic segmentation (OVSS). However, existing approaches often rely on impractical supervised pre-training or access to additional pre-trained networks. In this work, we propose a strong baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP), representing a straightforward adaptation of CLIP tailored for this scenario. Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature. By incorporating design choices favouring segmentation, our approach significantly improves performance without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning, making it highly practical for real-world applications. Experiments are performed on 8 popular semantic segmentation benchmarks, yielding state-of-the-art performance on most scenarios. Our code is publicly available at https://github.com/sinahmr/NACLIP.

Sina Hajimiri, Ismail Ben Ayed, Jose Dolz• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU17.4	3069
Semantic segmentation	ADE20K	mIoU19.1	1028
Semantic segmentation	Cityscapes	mIoU3.23e+3	668
Semantic segmentation	ADE20K	mIoU17.8	559
Semantic segmentation	Cityscapes (val)	mIoU35.5	527
Semantic segmentation	Cityscapes	mIoU36.74	494
Semantic segmentation	COCO Stuff	mIoU23.64	399
Semantic segmentation	ADE20K A-150	mIoU19.1	224
Semantic segmentation	Pascal Context	mIoU35.17	217
Semantic segmentation	Pascal Context 59	mIoU38.4	204

Showing 10 of 96 rows

...

Other info

Follow for update

@wizwand_team Discord