Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

About

Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. The project page is at \url{https://sites.google.com/view/diffseg/home}.

Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU37.7	3069
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU39.4	2204
Semantic segmentation	ADE20K	mIoU37.7	1028
Semantic segmentation	Cityscapes	mIoU16.8	668
Semantic segmentation	Cityscapes (val)	mIoU21.2	572
Semantic segmentation	PASCAL VOC (val)	mIoU49.8	380
Semantic segmentation	PASCAL Context (val)	mIoU48.8	360
Semantic segmentation	Pascal VOC	mIoU0.498	280
Semantic segmentation	COCO Object	mIoU23.2	139
Semantic segmentation	COCO Object (val)	mIoU0.232	101

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord