Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

About

Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: https://unianimate.github.io/.

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang• 2024

Related benchmarks

TaskDatasetResultRank
Human Dance GenerationTiktok (test)
SSIM0.811
17
Human Image AnimationTikTok
FVD148.1
15
Character Image AnimationFollow-Your-Pose V2
LPIPS0.183
15
Human Image AnimationUnseen100
L1 Loss2.82e+4
9
Character AnimationUser Study 20 identities and 20 driving videos (test)
Video Quality0.6
9
Character Image AnimationCoDanceBench (test)
LPIPS0.582
9
Multi-character video animationICE-bench
SSIM62.3
7
Human-Object Interaction Video GenerationMani4D (test)
Obj-IoU46.44
7
motion-conditioned image-to-video animationself-collected hair motion (test)
SSIM (Hair)0.9761
6
Human Video AnimationSelf-collected hair motion CG (test)
Average Vote Percentage9.5
6
Showing 10 of 13 rows

Other info

Follow for update