Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

About

Recent diffusion-based human image animation techniques have demonstrated impressive success in synthesizing videos that faithfully follow a given reference identity and a sequence of desired movement poses. Despite this, there are still two limitations: i) an extra reference model is required to align the identity image with the main video branch, which significantly increases the optimization burden and model parameters; ii) the generated video is usually short in time (e.g., 24 frames), hampering practical applications. To address these shortcomings, we present a UniAnimate framework to enable efficient and long-term human video generation. First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model. Second, we propose a unified noise input that supports random noised input as well as first frame conditioned input, which enhances the ability to generate long-term video. Finally, to further efficiently handle long sequences, we explore an alternative temporal modeling architecture based on state space model to replace the original computation-consuming temporal Transformer. Extensive experimental results indicate that UniAnimate achieves superior synthesis results over existing state-of-the-art counterparts in both quantitative and qualitative evaluations. Notably, UniAnimate can even generate highly consistent one-minute videos by iteratively employing the first frame conditioning strategy. Code and models will be publicly available. Project page: https://unianimate.github.io/.

Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang• 2024

Related benchmarks

TaskDatasetResultRank
Image-to-Video GenerationVBench I2V
Background Consistency91.17
24
Human Dance GenerationTiktok (test)
SSIM0.811
17
Human Image AnimationTikTok
FVD148.1
15
Character Image AnimationFollow-Your-Pose V2
LPIPS0.183
15
Human Image AnimationUnseen100
L1 Loss2.82e+4
9
Character AnimationUser Study 20 identities and 20 driving videos (test)
Video Quality0.6
9
Character Image AnimationCoDanceBench (test)
LPIPS0.582
9
Multi-character video animationICE-bench
SSIM62.3
7
Human-Object Interaction Video GenerationMani4D (test)
Obj-IoU46.44
7
Human Image AnimationHyperMotionX Bench
PSNR20.9
6
Showing 10 of 15 rows

Other info

Follow for update