Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

About

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy• 2023

Related benchmarks

TaskDatasetResultRank
Video Subject SwappingCustom Video Subject Swapping dataset human-evaluated (test)
Subject Identity19
14
Video EnhancementVC2
MS98.2
7
Video EnhancementAD2
MS Score0.975
7
Zero-shot Text-guided Video EditingCurated dataset 90-frames
CLIP-F90.63
7
Video EditingHOSNeRF and NeuMan (test)
CLIPScore26.11
6
Video StylizationTVSBench
CLIP-T20.62
6
Zero-shot Text-guided Video EditingCurated dataset 8-frames
CLIP-F92.87
6
Video Subject SwappingShutterstock and DAVIS predefined concepts (test)
Text Alignment24.99
5
Zero-shot Text-guided Video EditingCurated dataset 36-frames
CLIP-F8.97e+3
5
Video-to-Video Translation23 videos (test)
Frame Accuracy95.5
4
Showing 10 of 12 rows

Other info

Follow for update