FaceFormer: Speech-Driven 3D Facial Animation with Transformers
About
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D talking head generation | DualTalk (test) | FD (Expression)34.9 | 34 | |
| Talking Face Generation | LRW (test) | SSIM0.856 | 28 | |
| Co-speech 3D Gesture Synthesis | BEAT2 (test) | -- | 27 | |
| 3D talking head generation | DualTalk OOD set | FD (EXP)35.92 | 26 | |
| Talking Face Generation | LRS2 (test) | SSIM0.84 | 18 | |
| 3D Talking Face Generation | BIWI A (test) | LVE5.3077 | 16 | |
| Speech-driven gesture generation | BEAT-X | -- | 11 | |
| 3D facial animation generation | BIWI (test) | Mean Vertex Error5.95 | 10 | |
| 3D talking head animation | VOCASET (test) | LVE (x10^-5mm)4.109 | 10 | |
| Speech-Driven Facial Animation | BIWI B (test) | Lip Sync34.4 | 10 |