CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
About
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VBench | Quality Score82.75 | 111 | |
| Video Generation | UCF-101 (test) | Inception Score50.46 | 105 | |
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.2631 | 85 | |
| Text-to-Video Generation | UCF-101 | FVD626 | 61 | |
| Video Generation | UCF101 | FVD305 | 54 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD701.6 | 44 | |
| Video Generation | VBench (test) | -- | 35 | |
| Text-to-Video Generation | MSR-VTT | CLIPSIM0.2631 | 28 | |
| Video Frame Prediction | Kinetics-600 | gFVD109.2 | 28 | |
| Text-to-Video Generation | UCF-101 (test) | FVD701.6 | 25 |