CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

About

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, Jie Tang• 2022

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	VBench	Quality Score82.75	168
Video Generation	UCF-101 (test)	Inception Score50.46	105
Text-to-Video Generation	MSR-VTT (test)	CLIP Similarity0.2631	85
Video Generation	UCF101	FVD305	68
Video Generation	VBench (test)	--	66
Text-to-Video Generation	UCF-101	FVD626	61
Text-to-Video Generation	UCF-101 zero-shot	FVD701.6	59
Video Reasoning	VBVR-Bench Out-of-Domain	Average Score26.2	39
Video Frame Prediction	Kinetics-600	gFVD109.2	38
Video Generation	UCF-101	FVD626	30

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord