EmoOmni: Bridging Emotional Understanding and Expression in Omni-Modal LLMs
About
The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Emotional Dialogue | MELD EmoOmniEval (test) | VS-RES1.36 | 7 | |
| Emotional Dialogue Generation | ch2-sims v2 | Response MOS1.56 | 7 | |
| Multimodal Emotional Dialogue | ch-sims EmoOmniEval v2 (test) | VS-RES1.67 | 7 | |
| Speech Generation | ch2-sims v2 | WER4.72 | 4 |