EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

About

Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Multi-Modal Timestep Phase-aware Dynamical Allocation mechanism to modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization, Phase-aware Negative Classifier-Free Guidance (CFG), and Long Video CFG, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations.

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma• 2025

Related benchmarks

Task	Dataset	Result
Talking Head Generation	HDTF (test)	FID43.544	73
Talking Head Generation	HDTF	FID21.054	48
Talking avatar video generation	Short dataset real avatar images, 5s audio 1.0	FID78.65	10
Talking avatar video generation	EMTD (test)	FID67.35	10
Talking avatar video generation	Long dataset 25 synthesized avatar images, 20s audio clips 1.0	ASE4.87	10
Social Interaction Video Generation	Social Interaction Benchmark	Action Accuracy60.1	10
Audio-driven video generation	Custom evaluation dataset	Sync-C2.49	9
Talking head video generation	Action Bench (test)	Sync-C3.199	9
Interactive Avatar Generation	500 videos (test)	IQA3.96	8
Audio-driven video generation	Mead	FID25.43	8

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord