Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models
About
We introduce Vidu, a high-performance text-to-video generator that is capable of producing 1080p videos up to 16 seconds in a single generation. Vidu is a diffusion model with U-ViT as its backbone, which unlocks the scalability and the capability for handling long videos. Vidu exhibits strong coherence and dynamism, and is capable of generating both realistic and imaginative videos, as well as understanding some professional photography techniques, on par with Sora -- the most powerful reported text-to-video generator. Finally, we perform initial experiments on other controllable video generation, including canny-to-video generation, video prediction and subject-driven generation, which demonstrate promising results.
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, Jun Zhu• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| subject-to-video generation | OpenS2V-Eval zero-shot (test) | Total Score51.95 | 16 | |
| Subject-to-video | OpenS2V Eval | Total Score51.95 | 11 | |
| Subject-consistent Video Generation | User Study | Subject Consistency3.4 | 7 | |
| Still-to-Video (S2V) Generation | Diverse S2V (test) | Subject Consistency0.956 | 6 | |
| Multi-view appearance and expressive identity consistency | Multi-view appearance and expressive identity consistency (evaluation set) | DINO-I Score66.2 | 6 | |
| Face Similarity | Human (test) | Face Similarity (cur)0.549 | 5 | |
| Identity-Preserving Video Generation | Actor-Bench Contextual Generalization 1.0 (Setting 2) | Face Identity Score56.5 | 5 | |
| Multi-Concept Video Customization | Multi-concept video customization (test) | Average Score3.4 | 5 | |
| Multi-Concept Video Customization | Multi-Concept Video Customization (evaluation set) | CLIP-I0.696 | 5 | |
| Educational Video Generation | Educational Video Generation | -- | 5 |
Showing 10 of 10 rows