Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

About

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution, which can also remove potential artifacts and corruptions from low-resolution videos. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). Furthermore, our Show-1 model can be readily adapted for motion customization and video stylization applications through simple temporal attention layer finetuning. Our model achieves state-of-the-art performance on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench
Quality Score80.42
111
Text-to-Video GenerationMSR-VTT (test)
CLIP Similarity0.3072
85
Text-to-Video GenerationUCF-101
FVD369.3
61
Text-to-Video GenerationUCF-101 zero-shot
FVD394.5
44
Video GenerationVBench (test)--
35
Text-to-Video GenerationUCF-101 (test)
FVD394.5
25
Text-to-Video GenerationMSR-VTT zero-shot
CLIPSIM30.72
20
Text-to-Video GenerationVBench 2024 (test)
Total Score78.93
15
Video GenerationVBench Custom
Subject Consistency95.53
11
Text-to-Video GenerationMSR-VTT 2016 (test)
CLIPSIM0.3104
7
Showing 10 of 11 rows

Other info

Code

Follow for update