Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

About

Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.

Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationHDTF (test)
FVD374.6
33
Talking avatar video generationLong dataset 25 synthesized avatar images, 20s audio clips 1.0
ASE4.66
10
Talking avatar video generationEMTD (test)
FID75.2
10
Talking avatar video generationShort dataset real avatar images, 5s audio 1.0
FID87.24
10
Talking head video generationAction Bench (test)
Sync-C6.765
9
Audio-driven video generationCustom evaluation dataset
Sync-C3.85
9
Audio-driven Avatar GenerationGenBench-ShortVideo (test)
ASE3.53
7
Audio-driven Avatar GenerationGenBench ShortVideo (user study)
Naturalness71.1
7
Audio-driven GenerationTalkBench Short (10 s) 1.0 (test)
ASE3.06
7
Audio-guided human animationSoul-Bench
Video-Text Consistency4.77
6
Showing 10 of 13 rows

Other info

Follow for update