Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling

About

Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM

Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu• 2025

Related benchmarks

TaskDatasetResultRank
Gesture GenerationBEAT-2 (test)
BC0.714
22
Co-speech gesture generationBEAT-2 (1 Speaker)
BC0.714
17
Co-speech gesture generationBEAT All Speakers 2
BC0.525
16
Gesture GenerationBEAT (test)
BC72.9
12
Co-speech gesture generationBEAT 2 (Novel 5 speakers)
BC62.1
6
Gesture GenerationAudio2PhotoReal
Diversity2.34
5
Gesture GenerationBEAT 2
MOS (Realness)3.43
4
Showing 7 of 7 rows

Other info

Code

Follow for update