SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

About

A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn base motions and sparse motions, and then adaptively fuse them. In particular, coarse2fine cross-attention module and rhythmic consistency learning are explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, Zhigang Tu• 2024

Related benchmarks

Task	Dataset	Result
Co-speech gesture generation	BEAT All Speakers 2	BC0.727	31
Co-speech 3D Gesture Synthesis	BEAT2 (test)	FGD4.278	27
Speech-driven gesture generation	BEAT2 (Seen Speakers)	FGD0.428	18
Gesture Generation	BEAT2	FGD4.278	17
Co-speech gesture generation	BEAT One Speaker v2 (Speaker 2)	FGD4.278	12
Co-speech motion generation	SHOW v1 (test)	FGD20.18	8
Co-speech gesture generation	SHOW 4 speakers	FGD20.17	6
Co-speech gesture generation	BEAT2 All 25 Speakers	FGD5.214	5
Speech-driven gesture generation	BEAT2 (Unseen Speakers)	FGD5.687	4

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord