ControlVideo: Training-free Controllable Text-to-Video Generation

About

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, Qi Tian• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Cityscapes (test)	mIoU25.8	1252
Video Editing	20 in-the-wild cases	CLIP score26.87	8
Video Motion Editing	User Study 20 video cases	M-A Score94.1	7
Sim-to-Real Video Generation	Cityscapes (intra-domain)	Knowledge Checklist62.22	6
Video Editing	DAVIS (40 selected object-centric videos)	Prompt Consistency (P.C.)31.4	6
Video Editing	ShutterStock (30 unseen videos)	Prompt Consistency (P.C.)30.3	6
Video Editing	User Study	--	6
Depth-conditioned Video Generation	UVCBench	Aesthetic Quality64.5	5
Sim-to-Real Video Generation	Waymo intra-domain	Knowledge Checklist33.55	5
Accident Video Generation	MM-AU 1.0 (test)	CLIP S22.51	5

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord