Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

About

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution. Project website: https://boyuan.space/diffusion-forcing

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann• 2024

Related benchmarks

Task	Dataset	Result
Video Generation	UCF101	FVD274.5	68
Unconditional Generation	OpenWebText (OWT) L=1024 (held-out)	MAUVE0.254	45
Video Generation	SkyTimelapse	FVD251.9	22
Long-horizon Video Generation	RoboArena	PSNR15.81	19
Video Generation	FaceForensics	FVD99.5	15
Navigation	D4RL Maze2d-umaze	Normalized Return116.7	14
Locomotion	Cheetah-Wind-S c^s	Average Return-102	14
Locomotion	Cheetah-Wind-E (c^s)	Average Return-105.8	14
Locomotion	Cheetah-Vel-E (c^r)	Average Return-85.6	14
Locomotion	Ant-Dir-E c^r	Average Return195.4	14

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord