Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models

About

We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu• 2024

Related benchmarks

TaskDatasetResultRank
subject-to-video generationOpenS2V-Eval zero-shot (test)
Total Score51.95
16
Subject-to-videoOpenS2V Eval
Total Score51.95
11
Subject-consistent Video GenerationUser Study
Subject Consistency3.4
7
Still-to-Video (S2V) GenerationDiverse S2V (test)
Subject Consistency0.956
6
Multi-view appearance and expressive identity consistencyMulti-view appearance and expressive identity consistency (evaluation set)
DINO-I Score66.2
6
Face SimilarityHuman (test)
Face Similarity (cur)0.549
5
Identity-Preserving Video GenerationActor-Bench Contextual Generalization 1.0 (Setting 2)
Face Identity Score56.5
5
Multi-Concept Video CustomizationMulti-concept video customization (test)
Average Score3.4
5
Multi-Concept Video CustomizationMulti-Concept Video Customization (evaluation set)
CLIP-I0.696
5
Educational Video GenerationEducational Video Generation--
5
Showing 10 of 10 rows

Other info

Follow for update