ModelScope Text-to-Video Technical Report
About
This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VBench | Quality Score78.05 | 111 | |
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.293 | 85 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD410 | 44 | |
| Video Generation | VBench (test) | -- | 35 | |
| Text-to-Video Generation | MSR-VTT | -- | 28 | |
| Text-to-Video Generation | MSR-VTT zero-shot | CLIPSIM29.3 | 20 | |
| Cardiac disease classification | UKB-CAD | Accuracy71.4 | 17 | |
| Cardiac disease classification | UKB-CM | Accuracy80.5 | 17 | |
| Cardiac disease classification | UKB-HF | Accuracy80.7 | 17 | |
| Cardiac Phenotype Prediction | UKB dataset (test) | LVEDV MAE11.18 | 17 |