Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

About

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu• 2021

Related benchmarks

TaskDatasetResultRank
Audio-driven facial animationMEAD 41 (test)
PSNR26.529
26
Audio-driven facial animationRAVDESS 42 (test)
PSNR25.639
24
Audio Driven Talking Head GenerationCREMA
Sync5.7673
14
Audio Driven Talking Head GenerationMead
Sync6.7809
14
Talking Head ReenactmentGeneral Inference (test)
FPS13.817
13
Talking Head ReenactmentGeneral Inference
Inference Speed (FPS)13.817
13
Talking Face GenerationCREMA-D
FID72.81
9
Talking Face GenerationLRS2
ID-SIM0.225
8
Audio-visual synchronisationAudio-visual synchronization benchmark
LSE-C2.51
7
Audio-driven talking face generationVoxCeleb2
Sc6.172
6
Showing 10 of 17 rows

Other info

Follow for update