Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning

About

Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few pre-defined conditioning settings. To tackle these constraints, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. Our method can also generalize to both UNet-based and transformer-based architectures.

Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg• 2025

Related benchmarks

TaskDatasetResultRank
AnimationSynthetic Dataset (test)
CLIP-T26.83
3
Image AnimationUCF-101
User Preference Score45.4
3
InterpolationSynthetic Dataset (test)
CLIP-T27.29
3
Video Frame InterpolationUCF-101
User Score51.6
3
Image AnimationWAN 2.1
FVD101.7
2
OutpaintingUCF-101
User Preference Score61.4
2
RewindingUCF-101
User Preference60.2
2
Video Frame InterpolationWAN 2.1
FVD90.82
2
Showing 8 of 8 rows

Other info

Follow for update