Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

About

Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds. In this paper, we introduce the first application of a pretrained transformer-based video generative model that demonstrates strong generalization capabilities and generates highly dynamic, realistic videos for portrait animation, effectively addressing these challenges. The adoption of a new video backbone model makes previous U-Net-based methods for identity maintenance, audio conditioning, and video extrapolation inapplicable. To address this limitation, we design an identity reference network consisting of a causal 3D VAE combined with a stacked series of transformer layers, ensuring consistent facial identity across video sequences. Additionally, we investigate various speech audio conditioning and motion frame mechanisms to enable the generation of continuous video driven by speech audio. Our method is validated through experiments on benchmark and newly proposed wild datasets, demonstrating substantial improvements over prior methods in generating realistic portraits characterized by diverse orientations within dynamic and immersive scenes. Further visualizations and the source code are available at: https://fudan-generative-vision.github.io/hallo3/.

Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	VBench	--	209
Talking Head Generation	HDTF (test)	FID14.75	73
Talking Head Generation	HDTF	FID15.95	48
Portrait Image Animation	HDTF (test)	FID42.156	23
Talking Head Generation	VFHQ (test)	FID23.45	16
Talking Head Generation	HDTF-100 (test)	Sync-C Score6.814	15
Audio-driven portrait video generation	HDTF (test)	SC7.256	13
Image-to-Video lip-syncing	TalkVid	FID68.11	12
Talking avatar video generation	Long dataset 25 synthesized avatar images, 20s audio clips 1.0	ASE4.68	10
Keypoint-based Portrait Animation	Portrait Animation	CPBD0.4045	10

Showing 10 of 41 rows

Other info

Code

Follow for update

@wizwand_team Discord