Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions

About

End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios where multiple concepts could appear in the same video with rich human-human interactions and human-object interactions. Such a global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity's spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in an iterative manner. This design enables the high-quality generation of human dialogue videos between two to three people or video customization from multiple reference images. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods. Video demos are available at https://zhenzhiwang.github.io/interacthuman/

Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Yuan Zhang, Mingyuan Gao, Dahua Lin• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationRAVDESS
IQA Score4.602
8
Single-person talking-head generationCelebV-HQ
IQA3.834
8
Audio-conditioned full-body animationTwo-person audio conditioned human animation (test)
Sync-D6.67
6
Audio-conditioned full-body animationOmniHuman Single-person audio conditioned human animation (test)
Sync-C7.272
6
Multi-Concept Video CustomizationMulti-Concept Video Customization (evaluation set)
CLIP-I0.744
5
Multi-Concept Video CustomizationMulti-concept video customization (test)
Average Score4.01
5
Audio-driven video generationAudio-conditioned human animation (test)
Average Score2.48
3
Showing 7 of 7 rows

Other info

Follow for update