Flexible Diffusion Modeling of Long Videos
About
We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-Context Video Prediction | DMLab 64x64 | FVD148 | 12 | |
| Video Completion | GQN-Mazes | FVD53.1 | 8 | |
| Video Completion | MineRL | FVD267 | 8 | |
| Video Completion | CARLA Town01 | FVD117 | 8 | |
| Nighttime Video Deraining | SynNightRain (test) | PSNR23.49 | 8 | |
| Long-Context Video Prediction | Minecraft 128x128 (test) | SSIM0.349 | 6 | |
| Long Video Generation | FlintstonesHD 16 frames (test) | Avg-FID34.47 | 4 | |
| Long Video Generation | FlintstonesHD 256 frames (test) | Avg FID38.28 | 4 | |
| Long Video Generation | FlintstonesHD 1024 frames (test) | Avg-FID43.24 | 4 |