OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation
About
Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Talking Head Generation | HDTF (test) | FVD374.6 | 33 | |
| Talking avatar video generation | Long dataset 25 synthesized avatar images, 20s audio clips 1.0 | ASE4.66 | 10 | |
| Talking avatar video generation | EMTD (test) | FID75.2 | 10 | |
| Talking avatar video generation | Short dataset real avatar images, 5s audio 1.0 | FID87.24 | 10 | |
| Talking head video generation | Action Bench (test) | Sync-C6.765 | 9 | |
| Audio-driven video generation | Custom evaluation dataset | Sync-C3.85 | 9 | |
| Audio-driven Avatar Generation | GenBench-ShortVideo (test) | ASE3.53 | 7 | |
| Audio-driven Avatar Generation | GenBench ShortVideo (user study) | Naturalness71.1 | 7 | |
| Audio-driven Generation | TalkBench Short (10 s) 1.0 (test) | ASE3.06 | 7 | |
| Audio-guided human animation | Soul-Bench | Video-Text Consistency4.77 | 6 |