Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

About

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.

Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationMSR-VTT (test)--
85
Text-to-Video GenerationUCF-101
FVD355.2
61
Video GenerationUCF101
FVD355.2
54
Text-to-Video GenerationUCF-101 zero-shot
FVD355.2
44
Text-to-Video GenerationMSR-VTT--
28
Text-to-Video GenerationUCF-101 (test)
FVD355.2
25
Text-to-Video GenerationMSR-VTT zero-shot
CLIPSIM32.04
20
Zero-shot video generationUCF-101 v1.0 (train test)
FVD355.2
12
Text-to-Video GenerationMSR-VTT 63
FID9.73
7
Showing 9 of 9 rows

Other info

Follow for update