Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Individual Styles of Conversational Gesture

About

Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system. Our proposed model significantly outperforms baseline methods in a quantitative comparison. To support research toward obtaining a computational understanding of the relationship between gesture and speech, we release a large video dataset of person-specific gestures. The project website with video, code and data can be found at http://people.eecs.berkeley.edu/~shiry/speech2gesture .

Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik• 2019

Related benchmarks

TaskDatasetResultRank
Co-speech 3D Gesture SynthesisBEAT2 (test)
FGD28.15
27
Gesture GenerationBEAT-2 (test)
BC4.683
22
Co-Speech Gesture Video GenerationPATS (test)
Diversity2.49
22
Gesture GenerationBEAT2
FGD28.15
17
Co-speech motion generationBEATX (test)
FGD25.129
16
3D co-speech gesture generationBEAT-ETrans (test)
FGD (h+t)25.56
14
3D co-speech gesture generationTED-ETrans (test)
FGD_h+t18.16
14
Speech to gesture translationSpeech2Gesture 1.0 (test)
Fooled Rate (%)19.8
12
Co-speech gesture generationBEATX Standard (test)
FGD25.129
11
Speech-driven gesture generationBEAT-X
FGD28.15
11
Showing 10 of 34 rows

Other info

Code

Follow for update