Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

About

Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker

Dingdong Wang, Shujie Liu, Tianhua Zhang, Youjun Chen, Jinyu Li, Helen Meng• 2026

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionIEMOCAP
Accuracy77.68
71
Emotion RecognitionRAVDESS
Accuracy71.56
19
Speech Emotion RecognitionMELD
Accuracy59.71
19
Emotion ReasoningOverall (test)
Factual Alignment (FA)3.54
17
Emotion RecognitionSAVEE
Accuracy73.96
17
Emotion RecognitionIEMOCAP, MELD, RADESS, SAVEE Average
Average Accuracy68.89
17
Emotion ReasoningHuman evaluation 100-sample set
Factual Alignment3.7
5
Showing 7 of 7 rows

Other info

Follow for update