Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling

About

Real-time video dubbing that preserves identity consistency while achieving accurate lip synchronization remains a critical challenge. Existing approaches face a trilemma: diffusion-based methods achieve high visual fidelity but suffer from prohibitive computational costs, while GAN-based solutions sacrifice lip-sync accuracy or dental details for real-time performance. We present MuseTalk, a novel two-stage training framework that resolves this trade-off through latent space optimization and spatio-temporal data sampling strategy. Our key innovations include: (1) During the Facial Abstract Pretraining stage, we propose Informative Frame Sampling to temporally align reference-source pose pairs, eliminating redundant feature interference while preserving identity cues. (2) In the Lip-Sync Adversarial Finetuning stage, we employ Dynamic Margin Sampling to spatially select the most suitable lip-movement-promoting regions, balancing audio-visual synchronization and dental clarity. (3) MuseTalk establishes an effective audio-visual feature fusion framework in the latent space, delivering 30 FPS output at 256*256 resolution on an NVIDIA V100 GPU. Extensive experiments demonstrate that MuseTalk outperforms state-of-the-art methods in visual fidelity while achieving comparable lip-sync accuracy. %The codes and models will be made publicly available upon acceptance. The code is made available at \href{https://github.com/TMElyralab/MuseTalk}{https://github.com/TMElyralab/MuseTalk}

Yue Zhang, Zhizhou Zhong, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou• 2024

Related benchmarks

TaskDatasetResultRank
Video-to-Video lip-syncingTalkVid Self-Reenactment
FID47.78
9
Lip synchronizationHDTF
FID8.759
8
Lip synchronizationAIGC-LipSync
FID17.668
8
Talking Head GenerationRealWorld-LipSync
FID16.894
7
Talking Head GenerationTalk9
Sync-C5.586
7
Video-to-Video lip-syncingTalkVid Novel Audio
FID49.59
4
Video DubbingRealWorld-LipSync
Lip Sync Accuracy2.44
4
Showing 7 of 7 rows

Other info

Follow for update