DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents

About

Diffusion models (DMs) have revolutionized generative learning. They utilize a diffusion process to encode data into a simple Gaussian distribution. However, encoding a complex, potentially multimodal data distribution into a single continuous Gaussian distribution arguably represents an unnecessarily challenging learning problem. We propose Discrete-Continuous Latent Variable Diffusion Models (DisCo-Diff) to simplify this task by introducing complementary discrete latent variables. We augment DMs with learnable discrete latents, inferred with an encoder, and train DM and encoder end-to-end. DisCo-Diff does not rely on pre-trained networks, making the framework universally applicable. The discrete latents significantly simplify learning the DM's complex noise-to-data mapping by reducing the curvature of the DM's generative ODE. An additional autoregressive transformer models the distribution of the discrete latents, a simple step because DisCo-Diff requires only few discrete variables with small codebooks. We validate DisCo-Diff on toy data, several image synthesis tasks as well as molecular docking, and find that introducing discrete latents consistently improves model performance. For example, DisCo-Diff achieves state-of-the-art FID scores on class-conditioned ImageNet-64/128 datasets with ODE sampler.

Yilun Xu, Gabriele Corso, Tommi Jaakkola, Arash Vahdat, Karsten Kreis• 2024

Related benchmarks

Task	Dataset	Result
Image Generation	ImageNet 64x64 (train val)	FID1.22	83
Image Generation	ImageNet 128x128	FID1.73	74
Molecular Docking	PDBBind (unseen receptors)	Top-1 RMSD < 2Å (%)18.5	17
Image Generation	ImageNet 128x128 (val)	FID1.73	15
Image Generation	ImageNet 128x128 (train val)	FID1.73	8
Molecular Docking	PDBBind Full (test)	Top-1 Success Rate (2Å)35.4	8

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord