Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

About

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS en (test)
WER2.67
50
Speech-to-TextVoiceBench
AlpacaEval Score2.57
15
Text-to-SpeechSeed-TTS-Eval zh (test)
CER3.37
8
Text-to-SpeechSeed-TTS-Eval hard (test)
WER13.67
6
3D facial animation generationHuman A/B preference study
Win Rate0.8
3
Showing 5 of 5 rows

Other info

Follow for update