CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior
About
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D talking head generation | DualTalk (test) | FD (Expression)48.57 | 34 | |
| Co-speech 3D Gesture Synthesis | BEAT2 (test) | -- | 27 | |
| 3D talking head generation | DualTalk OOD set | FD (EXP)50.05 | 26 | |
| 3D Talking Face Generation | BIWI A (test) | LVE4.7914 | 16 | |
| Speech-driven gesture generation | BEAT-X | -- | 11 | |
| Speech-Driven Facial Animation | BIWI B (test) | Lip Sync92.47 | 10 | |
| Speech-Driven Facial Animation | VOCA (test) | Lip Sync95.7 | 10 | |
| 3D talking head animation | VOCASET (test) | LVE (x10^-5mm)3.9445 | 10 | |
| Talking head synthesis | Conver-3D YouTube (test) | FDD17.72 | 9 | |
| Speech-Driven Facial Animation | Hybrid Audio Vocaset, LJSpeech, and FaceTalk 1.0 (test) | LSE-D11.8054 | 8 |