Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

About

Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.

Yancheng Wang, Osama Hanna, Ruiming Xie, Xianfeng Rui, Maohao Shen, Xuedong Zhang, Christian Fuegen, Jilong Wu, Debjyoti Paul, Arthur Guo, Zhihong Lei, Ozlem Kalinli, Qing He, Yingzhen Yang• 2026

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionIEMOCAP
Accuracy62.26
71
Emotion ClassificationIEMOCAP (test)
Weighted-F174.02
36
Emotion RecognitionMELD (test)--
26
Speech Emotion RecognitionMELD--
19
Speech Emotion RecognitionIEMOCAP → MELD Cross-Domain
Weighted F160.28
14
Speech Emotion RecognitionMELD → IEMOCAP Cross-Domain
Weighted F151.75
14
Emotion RecognitionMELD
UACC64.34
12
Speech Emotion RecognitionASVP-ESD Mixlingual
Weighted F10.7136
8
Speech Emotion RecognitionIEMOCAP In-Domain
Weighted F173.4
3
Human evaluation of prosodic grounding and reasoning qualityIEMOCAP and MELD (test)
Evaluator 1 Score4.05
2
Showing 10 of 10 rows

Other info

Follow for update