MAGVIT: Masked Generative Video Transformer
About
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)319.4 | 815 | |
| Class-conditional Image Generation | ImageNet 512x512 | FID1.91 | 111 | |
| Video Generation | UCF-101 (test) | Inception Score89.27 | 105 | |
| Video Compression | MCL-JCV | -- | 79 | |
| Video Prediction | Kinetics-600 (test) | FVD9.9 | 46 | |
| Video Reconstruction | UCF-101 | rFVD25 | 39 | |
| Video Prediction | BAIR Robot Pushing | FVD31 | 38 | |
| Video Frame Prediction | Kinetics-600 | gFVD9.9 | 38 | |
| Video Prediction | BAIR Push (test) | FVD62 | 30 | |
| Video Generation | UCF-101 | FVD76 | 30 |