Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

About

Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada• 2025

Related benchmarks

TaskDatasetResultRank
Speaking facial motion generationSeamless Interaction (test)
LVE6.79
13
Listening facial motion generationSeamless Interaction (test)
FDD30.62
9
Talking head synthesisConver-3D YouTube (test)
FDD11.6
9
3D Head AnimationCapTalkingHead (test)
LVE7.71
8
3D mesh modelingMANGO-Dialog (test)
LVE2.452
6
3D mesh modelingDualTalk (test)
LVE (Error)2.368
6
Speaking Head Motion GenerationSeamless Interaction Dataset
LVE6.79
6
Speech-driven 3D Facial AnimationVOCASET
LVE9.78
4
2D Image GenerationMANGO-Dialog (test)
PSNR25.43
4
Listening Head Motion GenerationSeamless Interaction Dataset
FDD30.62
4
Showing 10 of 12 rows

Other info

Follow for update