ModelScope Text-to-Video Technical Report
About
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VBench | Quality Score78.05 | 155 | |
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.293 | 85 | |
| Text-to-Video Generation | T2V-CompBench | Consistency Attribute Score0.5148 | 63 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD410 | 59 | |
| Video Generation | VBench (test) | -- | 48 | |
| Video Generation | EvalCrafter | Visual Quality Score14.92 | 28 | |
| Text-to-Video Generation | MSR-VTT | -- | 28 | |
| Text-to-Video Generation | MSR-VTT zero-shot | FVD536 | 26 | |
| Cardiac disease classification | UKB-CAD | Accuracy71.4 | 17 | |
| Cardiac disease classification | UKB-CM | Accuracy80.5 | 17 |