Towards Interactive Intelligence for Digital Humans

About

We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.

Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang, Caixin Kang, Kunhang Li, Haiyang Liu, Ruicong Liu, Yun Liu, Dianwen Ng, Zixiong Su, Erwin Wu, Yuhan Wu, Dingkun Yan, Tianyu Yan, Chang Zeng, Bo Zheng, You Zhou• 2025

Related benchmarks

Task	Dataset	Result
Text-to-motion generation	HumanML3D (test)	FID0.057	553
Speaking facial motion generation	Seamless Interaction (test)	LVE5.83	13
Listening facial motion generation	Seamless Interaction (test)	FDD17.12	9
Human Animation	AvatarDiT (val)	FID68.72	5
Facial Animation	User Study 25 participants	Lip Sync Score86.06	4
Human Animation	Multi-view consistency evaluation set	SSIM0.8134	4
Text-to-Motion Synthesis	BABEL (test)	PJ Score0.713	4
Text-to-Speech	Seed-TTS-Eval ZH	UTMOS2.68	3
Text-to-Speech	CommonVoice JA	UTMOS2.65	3
Text-to-Speech	Seed-TTS-Eval EN	UTMOS3.56	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord