Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
About
Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | MSR-VTT (test) | -- | 85 | |
| Text-to-Video Generation | UCF-101 | FVD355.2 | 61 | |
| Video Generation | UCF101 | FVD355.2 | 54 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD355.2 | 44 | |
| Text-to-Video Generation | MSR-VTT | -- | 28 | |
| Text-to-Video Generation | UCF-101 (test) | FVD355.2 | 25 | |
| Text-to-Video Generation | MSR-VTT zero-shot | CLIPSIM32.04 | 20 | |
| Zero-shot video generation | UCF-101 v1.0 (train test) | FVD355.2 | 12 | |
| Text-to-Video Generation | MSR-VTT 63 | FID9.73 | 7 |