How much is a noisy image worth? Data Scaling Laws for Ambient Diffusion

About

The quality of generative models depends on the quality of the data they are trained on. Creating large-scale, high-quality datasets is often expensive and sometimes impossible, e.g. in certain scientific applications where there is no access to clean data due to physical or instrumentation constraints. Ambient Diffusion and related frameworks train diffusion models with solely corrupted data (which are usually cheaper to acquire) but ambient models significantly underperform models trained on clean data. We study this phenomenon at scale by training more than $80$ models on data with different corruption levels across three datasets ranging from $30,000$ to $\approx 1.3$M samples. We show that it is impossible, at these sample sizes, to match the performance of models trained on clean data when only training on noisy data. Yet, a combination of a small set of clean data (e.g.~$10\%$ of the total dataset) and a large set of highly noisy data suffices to reach the performance of models trained solely on similar-size datasets of clean data, and in particular to achieve near state-of-the-art performance. We provide theoretical evidence for our findings by developing novel sample complexity bounds for learning from Gaussian Mixtures with heterogeneous variances. Our theoretical model suggests that, for large enough datasets, the effective marginal utility of a noisy sample is exponentially worse than that of a clean sample. Providing a small set of clean samples can significantly reduce the sample size requirements for noisy data, as we also observe in our experiments.

Giannis Daras, Yeshwanth Cherapanamjeri, Constantinos Daskalakis• 2024

Related benchmarks

Task	Dataset	Result
Image Generation	CIFAR-10 32x32	FID2.81	151
Image Generation	CelebA-64	FID6.81	103
Denoising	CIFAR-10 32x32	FID11.93	13
Denoising	CelebA-HQ 64x64	FID12.97	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord