Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation

About

Recent work on human animation usually incorporates large-scale video models, thereby achieving more vivid performance. However, the practical use of such methods is hindered by the slow inference speed and high computational demands. Moreover, traditional work typically employs separate models for each animation task, increasing costs in multi-task scenarios and worsening the dilemma. To address these limitations, we introduce EchoMimicV3, an efficient framework that unifies multi-task and multi-modal human animation. At the core of EchoMimicV3 lies a threefold design: a Soup-of-Tasks paradigm, a Soup-of-Modals paradigm, and a novel training and inference strategy. The Soup-of-Tasks leverages multi-task mask inputs and a counter-intuitive task allocation strategy to achieve multi-task gains without multi-model pains. Meanwhile, the Soup-of-Modals introduces a Coupled-Decoupled Multi-Modal Cross Attention module to inject multi-modal conditions, complemented by a Multi-Modal Timestep Phase-aware Dynamical Allocation mechanism to modulate multi-modal mixtures. Besides, we propose Negative Direct Preference Optimization, Phase-aware Negative Classifier-Free Guidance (CFG), and Long Video CFG, which ensure stable training and inference. Extensive experiments and analyses demonstrate that EchoMimicV3, with a minimal model size of 1.3 billion parameters, achieves competitive performance in both quantitative and qualitative evaluations.

Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationHDTF (test)
FVD595.8
33
Talking avatar video generationShort dataset real avatar images, 5s audio 1.0
FID78.65
10
Talking avatar video generationEMTD (test)
FID67.35
10
Talking avatar video generationLong dataset 25 synthesized avatar images, 20s audio clips 1.0
ASE4.87
10
Audio-driven video generationCustom evaluation dataset
Sync-C2.49
9
Talking head video generationAction Bench (test)
Sync-C3.199
9
Audio-driven GenerationTalkBench Short (10 s) 1.0 (test)
ASE3.45
7
Audio-driven Digital Human GenerationShort Sequence
Sync-C6.13
6
Showing 8 of 8 rows

Other info

Follow for update