Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

About

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou• 2022

Related benchmarks

Task	Dataset	Result
Video Editing	TGVE benchmark	ViCLIPdir16.2	20
Video Editing	V2VBench	Frames Quality5.001	17
Motion Customization	TGVE 76 videos (full)	Temporal Consistency92.42	16
Video Subject Swapping	Custom Video Subject Swapping dataset human-evaluated (test)	Subject Identity32	14
Text Alignment	User Study	Average Ranking2.28	12
Class-conditioned Image-to-Video Generation	Something-Something v2	FVD291.4	9
Text-conditioned Video Prediction	Something-Something v2	FVD291.4	8
Text-conditioned Video Prediction	Epic Kitchens 100	FVD365	8
Video Editing	20 in-the-wild cases	CLIP score27.71	8
Text-conditioned Video Prediction	BridgeData	FVD515.7	8

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord