MAGVIT: Masked Generative Video Transformer

About

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos\'e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang• 2022

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)319.4	967
Image Generation	ImageNet 256x256 (val)	FID3.65	399
Class-conditional Image Generation	ImageNet 512x512	FID1.91	126
Video Generation	UCF-101 (test)	Inception Score89.27	105
Video Compression	MCL-JCV	--	79
Video Prediction	Kinetics-600 (test)	FVD9.9	46
Video Reconstruction	UCF-101	rFVD25	39
Video Prediction	BAIR Robot Pushing	FVD31	38
Video Frame Prediction	Kinetics-600	gFVD9.9	38
Video Prediction	BAIR Push (test)	FVD62	30

Showing 10 of 29 rows

Other info

Code

Follow for update

@wizwand_team Discord