Towards Interactive Intelligence for Digital Humans
About
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-motion generation | HumanML3D (test) | FID0.057 | 331 | |
| Speaking facial motion generation | Seamless Interaction (test) | LVE5.83 | 10 | |
| Listening facial motion generation | Seamless Interaction (test) | FDD17.12 | 6 | |
| Human Animation | AvatarDiT (val) | FID68.72 | 5 | |
| Facial Animation | User Study 25 participants | Lip Sync Score86.06 | 4 | |
| Human Animation | Multi-view consistency evaluation set | SSIM0.8134 | 4 | |
| Text-to-Motion Synthesis | BABEL (test) | PJ Score0.713 | 4 | |
| Text-to-Speech | Seed-TTS-Eval ZH | UTMOS2.68 | 3 | |
| Text-to-Speech | CommonVoice JA | UTMOS2.65 | 3 | |
| Text-to-Speech | Seed-TTS-Eval EN | UTMOS3.56 | 3 |