MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

About

Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multi-talk.github.io/.

Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh• 2024

Related benchmarks

Task	Dataset	Result
Talking head synthesis	Conver-3D YouTube (test)	FDD24.42	9
3D Head Animation	CapTalkingHead (test)	LVE12.13	8
3D talking head generation	Polyset Portuguese	LVE0.317	7
3D talking head generation	Polyset Italian	LVE0.312	7
3D talking head generation	Polyset Spanish	LVE0.42	7
3D talking head generation	Polyset 20 Languages (multilingual)	LVE0.286	7
3D talking head generation	Polyset English	LVE0.439	7
3D talking head generation	Polyset Japanese	LVE0.237	7

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord