Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

About

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu• 2024

Related benchmarks

Task	Dataset	Result
Subject-to-video	OpenS2V Eval	Total Score51.95	23
Subject-Preserving Video Generation	OpenS2V-Eval Human-Domain	Total Score51.11	17
subject-to-video generation	OpenS2V-Eval zero-shot (test)	Total Score51.95	16
Subject-consistent Video Generation	User Study	Subject Consistency3.4	7
Still-to-Video (S2V) Generation	Diverse S2V (test)	Subject Consistency0.956	6
Multi-view appearance and expressive identity consistency	Multi-view appearance and expressive identity consistency (evaluation set)	DINO-I Score66.2	6
Face Similarity	Human (test)	Face Similarity (cur)0.549	5
Identity-Preserving Video Generation	Actor-Bench Contextual Generalization 1.0 (Setting 2)	Face Identity Score56.5	5
Multi-Concept Video Customization	Multi-concept video customization (test)	Average Score3.4	5
Multi-Concept Video Customization	Multi-Concept Video Customization (evaluation set)	CLIP-I0.696	5

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord