EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

About

The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.

Wenjie Tian, Zhixian Zhao, Jingbin Hu, Huakang Chen, Haohe Liu, Binshen Mu, Lei Xie• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Emotional Dialogue	MELD EmoOmniEval (test)	VS-RES1.36	7
Emotional Dialogue Generation	ch2-sims v2	Response MOS1.56	7
Multimodal Emotional Dialogue	ch-sims EmoOmniEval v2 (test)	VS-RES1.67	7
Speech Generation	ch2-sims v2	WER4.72	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord