VideoGPT: Video Generation using VQ-VAE and Transformers

About

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas• 2021

Related benchmarks

Task	Dataset	Result
Video Generation	UCF-101 (test)	Inception Score24.69	105
Video Generation	UCF101	FVD2.88e+3	68
Video Prediction	BAIR (test)	FVD103.3	59
Video Prediction	BAIR Robot Pushing	FVD103.3	38
Video Prediction	Bair	FVD103.3	34
Cardiac Phenotype Prediction	UKB dataset (test)	LVEDV MAE11.13	31
Video Prediction	BAIR 64x64 (test)	FVD103.3	27
Video Generation	SkyTimelapse	FVD222.7	22
Video Prediction	UCF-101 (test)	--	19
Cardiac disease classification	UKB-HF	Accuracy80	17

Showing 10 of 36 rows

Other info

Code

Follow for update

@wizwand_team Discord