HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

About

Audio-driven talking head generation faces a fundamental trade-off between personalization and generalization, limiting its practical application. Implicit models often achieve generalization at the cost of structural incoherence, resulting in unstable head motion and inaccurate lip synchronization. While explicit methods incorporate geometric and anatomical priors such as 3D Morphable Models (3DMMs), which parameterize facial geometry, or Action Units (AUs), which code facial muscle movements--they tend to produce overly neutral expressions or suffer from limited generalization. To resolve this conflict, we present HM-Talker, an audio-driven talking head framework that synergistically integrates explicit articulatory cues with implicit prosodic features to characterize identity-specific dynamics while enabling audio-driven generalization. Its distinctive features can be summarized as: i) the Cross-Modal Mapping Module (CMMM) that extracts a comprehensive vocabulary of motion cues from audio and video, and ii) the Hybrid Motion Modeling Module (HMMM) that employs a Stochastic Feature Pairing (SFP) strategy to dynamically merge paired implicit and explicit features for motion synthesis. This design facilitates an iterative optimization of the lower face motion, alternating between identity-specific and identity-agnostic (audio-only) objectives. Extensive experiments demonstrate that HM-Talker outperforms state-of-the-art methods in both visual realism and lip-sync accuracy across diverse settings.

Shiyu Liu, Kui Jiang, Junjun Jiang, Xianming Liu, Xiaocheng Feng, Hongxun Yao, Qi Tian• 2025

Related benchmarks

Task	Dataset	Result
Talking head synthesis	May avatar Lieu audio	Sync-D7.292	10
Talking head synthesis	May avatar Shaheen audio	Sync-D7.59	10
Talking head synthesis	Portrait Video Self-reconstruction (test)	PSNR35.15	8

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord