Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots
About
Co-speech gestures enhance interaction experiences between humans as well as between humans and robots. Existing robots use rule-based speech-gesture association, but this requires human labor and prior knowledge of experts to be implemented. We present a learning-based co-speech gesture generation that is learned from 52 h of TED talks. The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures. The model successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures. In a subjective evaluation, participants reported that the gestures were human-like and matched the speech content. We also demonstrate a co-speech gesture with a NAO robot working in real time.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D co-speech gesture generation | BEAT-ETrans (test) | FGD (h+t)40.95 | 14 | |
| 3D co-speech gesture generation | TED-ETrans (test) | FGD_h+t29.6 | 14 | |
| Co-speech gesture synthesis | TED (test) | FGD18.154 | 9 | |
| Gesture Generation | BEAT official recomputed (test) | Hellinger Distance (Avg)0.146 | 7 | |
| Co-speech gesture generation | TED Gesture | FGD18.154 | 7 | |
| Co-speech gesture generation | TED Gesture & TED Expressive User Study (test) | Naturalness1.22 | 7 | |
| Co-speech gesture generation | TED Expressive | FGD54.92 | 7 | |
| Gesture Synthesis | TED Gesture (test) | MAJE45.62 | 7 | |
| Gesture Synthesis | BEAT (Body-Expression-Audio-Text) 1.0 (test) | FGD261.3 | 7 | |
| Speech-driven gesture generation | BEAT (test) | Global CCA42.9 | 7 |