Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

About

Audio-driven avatar interaction demands real-time, streaming, and infinite-length generation -- capabilities fundamentally at odds with the sequential denoising and long-horizon drift of current diffusion models. We present Live Avatar, an algorithm-system co-designed framework that addresses both challenges for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while a set of complementary long-horizon strategies eliminate identity drift and visual artifacts, enabling stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep, converting the sequential diffusion chain into an asynchronous spatial pipeline that simultaneously boosts throughput and improves temporal consistency. Live Avatar achieves 45 FPS with a TTFF of 1.21\,s on 5 H800 GPUs, and to our knowledge is the first to enable practical real-time streaming of a 14B diffusion model for infinite-length avatar generation. We further introduce GenBench, a standardized long-form benchmark, to facilitate reproducible evaluation. Our project page is at https://liveavatar.github.io/.

Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi• 2025

Related benchmarks

TaskDatasetResultRank
Talking Head GenerationHDTF
FID15.85
33
Talking Avatar GenerationCelebV-HQ (clips)
FID37.63
10
Talking Avatar Generationlong-form videos (test)
FID57.81
10
Audio-driven video generationCustom evaluation dataset
Sync-C3.89
9
Audio-driven Avatar GenerationGenBench ShortVideo (user study)
Naturalness86.3
7
Audio-driven Avatar GenerationGenBench-ShortVideo (test)
ASE3.44
7
Audio-driven GenerationTalkBench Short (10 s) 1.0 (test)
ASE3.1
7
Audio-driven Avatar GenerationGenBench-LongVideo (test)
ASE3.38
6
Audio-driven GenerationTalkBench Long (> 5 min) 1.0 (test)
ASE3.15
6
Talking Head GenerationEMTD
Sync-C6.93
4
Showing 10 of 12 rows

Other info

GitHub

Follow for update