Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ModelScope Text-to-Video Technical Report

About

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i.e., Stable Diffusion). ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame generation and smooth movement transitions. The model could adapt to varying frame numbers during training and inference, rendering it suitable for both image-text and video-text datasets. ModelScopeT2V brings together three components (i.e., VQGAN, a text encoder, and a denoising UNet), totally comprising 1.7 billion parameters, in which 0.5 billion parameters are dedicated to temporal capabilities. The model demonstrates superior performance over state-of-the-art methods across three evaluation metrics. The code and an online demo are available at \url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench
Quality Score78.05
111
Text-to-Video GenerationMSR-VTT (test)
CLIP Similarity0.293
85
Text-to-Video GenerationUCF-101 zero-shot
FVD410
44
Video GenerationVBench (test)--
35
Text-to-Video GenerationMSR-VTT--
28
Text-to-Video GenerationMSR-VTT zero-shot
CLIPSIM29.3
20
Cardiac disease classificationUKB-CAD
Accuracy71.4
17
Cardiac disease classificationUKB-CM
Accuracy80.5
17
Cardiac disease classificationUKB-HF
Accuracy80.7
17
Cardiac Phenotype PredictionUKB dataset (test)
LVEDV MAE11.18
17
Showing 10 of 33 rows

Other info

Code

Follow for update