Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
About
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Editing | TGVE benchmark | ViCLIPdir16.2 | 20 | |
| Video Editing | V2VBench | Frames Quality5.001 | 17 | |
| Video Subject Swapping | Custom Video Subject Swapping dataset human-evaluated (test) | Subject Identity32 | 14 | |
| Text Alignment | User Study | Average Ranking2.28 | 12 | |
| Motion Customization | TGVE 76 videos (full) | Text Alignment25.64 | 12 | |
| Class-conditioned Image-to-Video Generation | Something-Something v2 | FVD291.4 | 9 | |
| Text-conditioned Video Prediction | Something-Something v2 | FVD291.4 | 8 | |
| Text-conditioned Video Prediction | Epic Kitchens 100 | FVD365 | 8 | |
| Video Editing | 20 in-the-wild cases | CLIP score27.71 | 8 | |
| Text-conditioned Video Prediction | BridgeData | FVD515.7 | 8 |