Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
About
Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Animation | Synthetic Dataset (test) | CLIP-T26.83 | 3 | |
| Image Animation | UCF-101 | User Preference Score45.4 | 3 | |
| Interpolation | Synthetic Dataset (test) | CLIP-T27.29 | 3 | |
| Video Frame Interpolation | UCF-101 | User Score51.6 | 3 | |
| Image Animation | WAN 2.1 | FVD101.7 | 2 | |
| Outpainting | UCF-101 | User Preference Score61.4 | 2 | |
| Rewinding | UCF-101 | User Preference60.2 | 2 | |
| Video Frame Interpolation | WAN 2.1 | FVD90.82 | 2 |