Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

About

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.

Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU22.5	559
Semantic segmentation	Cityscapes	mIoU41.15	494
Semantic segmentation	COCO Stuff	mIoU27.89	399
Semantic segmentation	Pascal Context	mIoU40.31	217
Semantic segmentation	Pascal VOC	mIoU85.68	159
Open Vocabulary Semantic Segmentation	Pascal VOC 20	mIoU87.1	113
Open Vocabulary Semantic Segmentation	Cityscapes	mIoU36.6	81
Open Vocabulary Semantic Segmentation	ADE20K	mIoU21.1	80
Open Vocabulary Semantic Segmentation	COCOStuff (val)	mIoU30.2	60
Semantic segmentation	PASCAL VOC with background category VOC21 2012	mIoU65.8	51

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord