Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs

About

The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.

Wenjie Tian, Zhixian Zhao, Jingbin Hu, Huakang Chen, Haohe Liu, Binshen Mu, Lei Xie• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal Emotional DialogueMELD EmoOmniEval (test)
VS-RES1.36
7
Emotional Dialogue Generationch2-sims v2
Response MOS1.56
7
Multimodal Emotional Dialoguech-sims EmoOmniEval v2 (test)
VS-RES1.67
7
Speech Generationch2-sims v2
WER4.72
4
Showing 4 of 4 rows

Other info

Follow for update