FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation

About

The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However, the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient, resulting in temporal inconsistency. In this paper, we introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance, our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video, significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality, coherent videos, marking a notable improvement over existing zero-shot methods.

Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy• 2024

Related benchmarks

Task	Dataset	Result
Sim-to-Real Video Translation	nuPlan and CARLA	CLIP-R109.9	11
Video Enhancement	VC2	MS97.45	7
Video Enhancement	AD2	MS Score0.9667	7
Video Stylization	TVSBench	CLIP-T23.87	6
Multi-weather editing	Waymo Open Dataset	CLIP-S0.72	5
Multi-weather editing	nuScenes	CLIP-S0.71	5
Zero-shot Video Translation	23 videos (test)	Frame Accuracy97.8	4
Multi-weather editing	Waymo Open Dataset and nuScenes Dataset	Inference Speed (FPS)0.142	4
Text-to-Video Stylization	Pexels 50 videos (TV2V)	CLIP-T0.197	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord