Phenaki: Variable Length Video Generation From Open Domain Textual Description
About
We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Prediction | Kinetics-600 (test) | FVD36.4 | 46 | |
| Video Prediction | BAIR Robot Pushing | FVD97 | 38 | |
| Video Prediction | Bair | FVD97 | 34 | |
| Video Frame Prediction | Kinetics-600 | gFVD36.4 | 28 | |
| Video Prediction | Kinetics-600 | FVD36.4 | 18 | |
| Frame prediction | Bair | FVD97 | 15 | |
| 3D Chest CT Image Generation | Chest CT GenerateCT evaluation framework | FID104.3 | 6 | |
| Video Generation | Kinetics zero-shot 400 | FID37.74 | 6 | |
| Text-to-Video Generation | Kinetics 400 (test) | FID (Image)37.74 | 5 | |
| Long Video Generation | FlintstonesHD 16 frames (test) | Avg-FID40.14 | 4 |