Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAGVIT: Masked Generative Video Transformer

About

We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.

Lijun Yu, Yong Cheng, Kihyuk Sohn, Jos\'e Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang• 2022

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)319.4
815
Class-conditional Image GenerationImageNet 512x512
FID1.91
111
Video GenerationUCF-101 (test)
Inception Score89.27
105
Video CompressionMCL-JCV--
79
Video PredictionKinetics-600 (test)
FVD9.9
46
Video ReconstructionUCF-101
rFVD25
39
Video PredictionBAIR Robot Pushing
FVD31
38
Video Frame PredictionKinetics-600
gFVD9.9
38
Video PredictionBAIR Push (test)
FVD62
30
Video GenerationUCF-101
FVD76
30
Showing 10 of 27 rows

Other info

Code

Follow for update