One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing

About

We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating face-to-face video conferencing experiences.

Ting-Chun Wang, Arun Mallya, Ming-Yu Liu• 2020

Related benchmarks

Task	Dataset	Result
Face Reenactment	VoxCeleb1 (test)	SSIM0.761	16
Talking head video generation	HDTF	FID20.57	14
Video-driven Talking Head Generation (Self-Reenactment)	HDTF	FID22.27	12
Talking head video generation	Talkinghead1kh	FID30.52	8
Self-Reenactment	HDTF (test)	LPIPS0.2771	8
Cross-identity reenactment	CelebV 30	CSIM79.1	7
Image Animation	VoxCeleb (test)	CSIM92.7	7
Same-identity reconstruction	VoxCeleb 1 (test)	L1 Loss0.0445	7
Face Animation	VoxCeleb (test)	LPIPS0.107	7
Cross-identity reenactment	HDTF	FVD134.9	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord