Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
About
Audio-driven avatar interaction demands real-time, streaming, and infinite-length generation -- capabilities fundamentally at odds with the sequential denoising and long-horizon drift of current diffusion models. We present Live Avatar, an algorithm-system co-designed framework that addresses both challenges for a 14-billion-parameter diffusion model. On the algorithm side, a two-stage pipeline distills a pretrained bidirectional model into a causal, few-step streaming one, while a set of complementary long-horizon strategies eliminate identity drift and visual artifacts, enabling stable autoregressive generation exceeding 10000 seconds. On the system side, Timestep-forcing Pipeline Parallelism (TPP) assigns each GPU a fixed denoising timestep, converting the sequential diffusion chain into an asynchronous spatial pipeline that simultaneously boosts throughput and improves temporal consistency. Live Avatar achieves 45 FPS with a TTFF of 1.21\,s on 5 H800 GPUs, and to our knowledge is the first to enable practical real-time streaming of a 14B diffusion model for infinite-length avatar generation. We further introduce GenBench, a standardized long-form benchmark, to facilitate reproducible evaluation. Our project page is at https://liveavatar.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Talking Head Generation | HDTF | FID15.85 | 33 | |
| Talking Avatar Generation | CelebV-HQ (clips) | FID37.63 | 10 | |
| Talking Avatar Generation | long-form videos (test) | FID57.81 | 10 | |
| Audio-driven video generation | Custom evaluation dataset | Sync-C3.89 | 9 | |
| Audio-driven Avatar Generation | GenBench ShortVideo (user study) | Naturalness86.3 | 7 | |
| Audio-driven Avatar Generation | GenBench-ShortVideo (test) | ASE3.44 | 7 | |
| Audio-driven Generation | TalkBench Short (10 s) 1.0 (test) | ASE3.1 | 7 | |
| Audio-driven Avatar Generation | GenBench-LongVideo (test) | ASE3.38 | 6 | |
| Audio-driven Generation | TalkBench Long (> 5 min) 1.0 (test) | ASE3.15 | 6 | |
| Talking Head Generation | EMTD | Sync-C6.93 | 4 |