Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VideoGPT: Video Generation using VQ-VAE and Transformers

About

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas• 2021

Related benchmarks

TaskDatasetResultRank
Video GenerationUCF-101 (test)
Inception Score24.69
105
Video GenerationUCF101
FVD2.88e+3
68
Video PredictionBAIR (test)
FVD103.3
59
Video PredictionBAIR Robot Pushing
FVD103.3
38
Video PredictionBair
FVD103.3
34
Cardiac Phenotype PredictionUKB dataset (test)
LVEDV MAE11.13
31
Video PredictionBAIR 64x64 (test)
FVD103.3
27
Video GenerationSkyTimelapse
FVD222.7
22
Video PredictionUCF-101 (test)--
19
Cardiac disease classificationUKB-HF
Accuracy80
17
Showing 10 of 36 rows

Other info

Code

Follow for update