The Journey, Not the Destination: How Data Guides Diffusion Models
About
Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO. We provide code at https://github.com/MadryLab/journey-TRAK .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Contributor Attribution | Fashion Product | Diversity13.58 | 48 | |
| Contributor Attribution | ArtBench Post-Impressionism | Aesthetic Score-11.94 | 36 | |
| Contributor Attribution | CIFAR-20 | Inception Score10.8 | 32 | |
| Contributor Attribution | ArtBench Post-Impressionism (test) | Aesthetic Score-4.81 | 18 | |
| Contributor Attribution | CIFAR-20 (test) | Inception Score-1.67 | 16 |