Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Photorealistic Video Generation with Diffusion Models

About

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, Jos\'e Lezama• 2023

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256 (train val)
FID2.4
178
Class-conditional Image GenerationImageNet-1K 256x256 (test)
FID2.56
50
Video PredictionKinetics-600 (test)
FVD3.3
46
Text-to-Video GenerationUCF-101 zero-shot
FVD258.1
44
Class-conditioned Video GenerationUCF101 (test)
Fréchet Video Distance36
19
Conditional Video GenerationKinetics600 (test)
FVD (50k)3.3
10
Showing 6 of 6 rows

Other info

Follow for update