X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention

About

We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Specifically, given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions along with wide-range head movements. As its core, we leverage the generative prior of a pre-trained diffusion model as the rendering backbone, while achieve fine-grained head pose and expression control with novel controlling signals within the framework of ControlNet. In contrast to conventional coarse explicit controls such as facial landmarks, our motion control module is learned to interpret the dynamics directly from the original driving RGB inputs. The motion accuracy is further enhanced with a patch-based local control module that effectively enhance the motion attention to small-scale nuances like eyeball positions. Notably, to mitigate the identity leakage from the driving signals, we train our motion control modules with scaling-augmented cross-identity images, ensuring maximized disentanglement from the appearance reference modules. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences, and showcase its proficiency in generating captivating portrait animations with consistently maintained identity characteristics.

You Xie, Hongyi Xu, Guoxian Song, Chao Wang, Yichun Shi, Linjie Luo• 2024

Related benchmarks

Task	Dataset	Result
Portrait Animation (Self-reenactment)	VFHQ (test)	FVD575.3	23
Talking head video generation	HDTF	FID15.66	14
Talking head video generation	Talkinghead1kh	FID19.86	8
Self-Reenactment	HDTF (test)	LPIPS0.2118	8
Image-to-Image Video Generation	256x256 25 FPS	Inference Time (s)36.973	8
Cross-Reenactment	TalkingHead-1KH and LV100 (test)	ID-SIM0.678	7
Self-Reenactment	TalkingHead-1KH and LV100 (test)	L1 Loss0.049	7
Talking head synthesis	VFHQ (first 100 frames)	FID26.22	6
Talking head synthesis	Self-Collected Dataset 50 identities	FID32.77	6
Portrait Animation	HDTF cross ID reenactment	HPF2.945	6

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord