HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

About

Human-Centric Video Generation (HCVG) methods seek to synthesize human videos from multimodal inputs, including text, image, and audio. Existing methods struggle to effectively coordinate these heterogeneous modalities due to two challenges: the scarcity of training data with paired triplet conditions and the difficulty of collaborating the sub-tasks of subject preservation and audio-visual sync with multimodal inputs. In this work, we present HuMo, a unified HCVG framework for collaborative multimodal control. For the first challenge, we construct a high-quality dataset with diverse and paired text, reference images, and audio. For the second challenge, we propose a two-stage progressive multimodal training paradigm with task-specific strategies. For the subject preservation task, to maintain the prompt following and visual generation abilities of the foundation model, we adopt the minimal-invasive image injection strategy. For the audio-visual sync task, besides the commonly adopted audio cross-attention layer, we propose a focus-by-predicting strategy that implicitly guides the model to associate audio with facial regions. For joint learning of controllabilities across multimodal inputs, building on previously acquired capabilities, we progressively incorporate the audio-visual sync task. During inference, for flexible and fine-grained multimodal control, we design a time-adaptive Classifier-Free Guidance strategy that dynamically adjusts guidance weights across denoising steps. Extensive experimental results demonstrate that HuMo surpasses specialized state-of-the-art methods in sub-tasks, establishing a unified framework for collaborative multimodal-conditioned HCVG. Project Page: https://phantom-video.github.io/HuMo.

Liyang Chen, Tianxiang Ma, Jiawei Liu, Bingchuan Li, Zhuowei Chen, Lijie Liu, Xu He, Gen Li, Qian He, Zhiyong Wu• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	User Study	Interaction Plausibility Score3.92	16
Multimodal Customization	OC-Bench (test)	Face-Sim0.708	12
Video Personalization	OpenS2V-Eval & Self-Constructed Cross-Domain (test)	NANO-CLIP Score0.609	11
Video Personalization	OpenS2V-Eval & Self-Constructed (In-Domain test)	DINO-I Score0.317	11
Video Personalization	OpenS2V-Eval & Self-Constructed (test)	AES0.479	11
Video Customization	70-example benchmark 1.0 (test)	FaceSim Arc0.49	9
Identity-consistent video generation	HarmoView-Bench	Total Score73.11	8
HOI Video Generation	HOI video generation (test)	AES Score56.5	7
Text+Reference-to-Video (R2V) Generation	HOIVG-Bench 1.0 (test)	TA7.949	7
Multi-view appearance and expressive identity consistency	Multi-view appearance and expressive identity consistency (evaluation set)	DINO-I Score81	6

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord