Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
About
Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Subject Swapping | Custom Video Subject Swapping dataset human-evaluated (test) | Subject Identity19 | 14 | |
| Video Enhancement | VC2 | MS98.2 | 7 | |
| Video Enhancement | AD2 | MS Score0.975 | 7 | |
| Zero-shot Text-guided Video Editing | Curated dataset 90-frames | CLIP-F90.63 | 7 | |
| Video Editing | HOSNeRF and NeuMan (test) | CLIPScore26.11 | 6 | |
| Video Stylization | TVSBench | CLIP-T20.62 | 6 | |
| Zero-shot Text-guided Video Editing | Curated dataset 8-frames | CLIP-F92.87 | 6 | |
| Video Subject Swapping | Shutterstock and DAVIS predefined concepts (test) | Text Alignment24.99 | 5 | |
| Zero-shot Text-guided Video Editing | Curated dataset 36-frames | CLIP-F8.97e+3 | 5 | |
| Video-to-Video Translation | 23 videos (test) | Frame Accuracy95.5 | 4 |