What the DAAM: Interpreting Stable Diffusion Using Cross Attention

About

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Our code is at https://github.com/castorini/daam.

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture• 2022

Related benchmarks

Task	Dataset	Result
Segmentation	ImageNet Segmentation (test)	Accuracy83.78	23
Segmentation	PascalVOC Single-Class 2012 (val)	Accuracy78.8	23
Motion localization	MeViS	SL Score36	15
Multi-concept grounding	Multi-Concept Confusion Dataset	mIoU (fg)27.1	13
Object Semantic Segmentation	Animals category 100 images generated from Stable Diffusion v1.4 (test)	mIoU75.4	6
Mask Generation	VOC sim	mIoU66.2	6
Mask Generation	COCO-cap	mIoU48.4	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord