Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

About

We present a framework for modeling interactional communication in dyadic conversations: given multimodal inputs of a speaker, we autoregressively output multiple possibilities of corresponding listener motion. We combine the motion and speech audio of the speaker using a motion-audio cross attention transformer. Furthermore, we enable non-deterministic prediction by learning a discrete latent representation of realistic listener motion with a novel motion-encoding VQ-VAE. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions. Moreover, it produces realistic 3D listener facial motion synchronous with the speaker (see video). We demonstrate that our method outperforms baselines qualitatively and quantitatively via a rich suite of experiments. To facilitate this line of research, we introduce a novel and large in-the-wild dataset of dyadic conversations. Code, data, and videos available at https://evonneng.github.io/learning2listen/.

Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, Shiry Ginosar• 2022

Related benchmarks

TaskDatasetResultRank
3D talking head generationDualTalk (test)
FD (Expression)24.61
34
3D talking head generationDualTalk OOD set
FD (EXP)30.49
26
Speaking facial motion generationSeamless Interaction (test)
LVE3.4
13
Speech-driven 3D Facial Animation3D Face-to-Face Interaction Dataset
Facial Dynamics Distance (FD)38.92
11
Listening facial motion generationSeamless Interaction (test)
FDD44.77
9
Listening Head GenerationViCo
FD (Expression)33.93
8
Listener Facial Motion GenerationViCo (test)
FD Expression33.93
7
Listening Head GenerationViCo (test)
FD (Exp)33.93
6
Speaking Head Motion GenerationSeamless Interaction Dataset
LVE3.1
6
Audio-driven facial animationViCo
Lip Sync Acc3.872
5
Showing 10 of 21 rows

Other info

Follow for update