One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing
About
We propose a neural talking-head video synthesis model and demonstrate its application to video conferencing. Our model learns to synthesize a talking-head video using a source image containing the target person's appearance and a driving video that dictates the motion in the output. Our motion is encoded based on a novel keypoint representation, where the identity-specific and motion-related information is decomposed unsupervisedly. Extensive experimental validation shows that our model outperforms competing methods on benchmark datasets. Moreover, our compact keypoint representation enables a video conferencing system that achieves the same visual quality as the commercial H.264 standard while only using one-tenth of the bandwidth. Besides, we show our keypoint representation allows the user to rotate the head during synthesis, which is useful for simulating face-to-face video conferencing experiences.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Face Reenactment | VoxCeleb1 (test) | SSIM0.761 | 16 | |
| Video-driven Talking Head Generation (Self-Reenactment) | HDTF | FID22.27 | 12 | |
| Talking head video generation | HDTF | FID20.57 | 8 | |
| Talking head video generation | Talkinghead1kh | FID30.52 | 8 | |
| Self-Reenactment | HDTF (test) | LPIPS0.2771 | 8 | |
| Cross-identity reenactment | CelebV 30 | CSIM79.1 | 7 | |
| Same-identity reconstruction | VoxCeleb 1 (test) | L1 Loss0.0445 | 7 | |
| Cross-identity reenactment | HDTF | FVD134.9 | 6 | |
| Talking head synthesis | Self-Collected Dataset 50 identities | FID47.13 | 6 | |
| Talking head synthesis | VFHQ (first 100 frames) | FID71.58 | 6 |