VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

About

A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution. Despite its recent success in image synthesis, applying DPMs to video generation is still challenging due to high-dimensional data spaces. Previous methods usually adopt a standard diffusion process, where frames in the same video clip are destroyed with independent noises, ignoring the content redundancy and temporal correlation. This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. The denoising pipeline employs two jointly-learned networks to match the noise decomposition accordingly. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation. We further show that our decomposed formulation can benefit from pre-trained image diffusion models and well-support text-conditioned video creation.

Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	VBench	--	168
Video Generation	UCF-101 (test)	Inception Score72.22	105
Text-to-Video Generation	MSR-VTT (test)	CLIP Similarity0.2795	85
Video Generation	UCF101	FVD173	68
Text-to-Video Generation	UCF-101 zero-shot	FVD639.9	59
Video Generation	UCF-101	FVD173	30
Text-to-Video Generation	MSR-VTT zero-shot	FVD550	26
Class-Conditional Video Generation	UCF-101 v1.0 (train test)	FVD173	21
Video Generation	Video Generation	Sampling Time (s)22	21
Video Prediction	UCF-101 (test)	FVD173	19

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord