DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
About
We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Co-speech 3D Gesture Synthesis | BEAT2 (test) | FGD8.986 | 27 | |
| Gesture Generation | BEAT-2 (test) | BC0.743 | 22 | |
| Gesture Generation | BEAT2 | FGD8.986 | 17 | |
| Gesture Generation | BEAT (test) | BC74.3 | 12 | |
| 3D Gesture Motion Generation | BEAT-X | BC0.743 | 10 | |
| Speech-driven Holistic Expression and Gesture Generation | BEAT 2022 (test) | FMD324.7 | 9 | |
| Speech-driven Holistic Expression and Gesture Generation | SHOW 2023 (test) | FMD0.0018 | 8 | |
| Co-speech motion generation | SHOW v1 (test) | FGD24.87 | 8 | |
| Co-speech gesture generation | BEAT 2 | Naturalness8.33 | 8 | |
| Speech-driven gesture generation | BEAT (test) | -- | 7 |