Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

About

Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.

T\'eo Guichoux, Th\'eodor Lemerle, Shivam Mehta, Jonas Beskow, Gustav Eje Henter, Laure Soulier, Catherine Pelachaud, Nicolas Obin• 2025

Related benchmarks

TaskDatasetResultRank
Co-speech gesture generationBEAT All Speakers 2
BC0.824
31
Showing 1 of 1 rows

Other info

Follow for update