VACE: All-in-One Video Creation and Editing

About

Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu• 2025

Related benchmarks

Task	Dataset	Result
Image-to-Video Generation	VBench	Motion Smoothness0.97	46
Video Editing	OpenVE-Bench	Overall Score1.57	39
Video Generation	VBench	Motion Smoothness99	37
subject-to-video generation	OpenS2V	Total58.16	32
Instruction-Guided Video Editing	OpenVE-Bench	Overall Score3.01	29
Video Editing	OpenVE-Bench (test)	Overall Score3.01	28
Human Image Animation	RealisDance (val)	Subject Consistency93.56	27
Video Object Removal	Real-World Videos	Temporal Consistency Score0.9795	25
Image-to-Video Generation	VBench I2V	Background Consistency91.11	24
Subject-to-video	OpenS2V Eval	Total Score57.55	23

Showing 10 of 313 rows

...

Other info

Code

Follow for update

@wizwand_team Discord