Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Acoustic and Semantic Modeling of Emotion in Spoken Language

About

Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human emotions remains an important challenge. While emotional expression is inherently multimodal, this thesis focuses on emotions conveyed through spoken language and investigates how acoustic and semantic information can be jointly modeled to advance both emotion understanding and emotion synthesis from speech. The first part of the thesis studies emotion-aware representation learning through pre-training. We propose strategies that incorporate acoustic and semantic supervision to learn representations that better capture affective cues in speech. A speech-driven supervised pre-training framework is also introduced to enable large-scale emotion-aware text modeling without requiring manually annotated text corpora. The second part addresses emotion recognition in conversational settings. Hierarchical architectures combining cross-modal attention and mixture-of-experts fusion are developed to integrate acoustic and semantic information across conversational turns. Finally, the thesis introduces a textless and non-parallel speech-to-speech framework for emotion style transfer that enables controllable emotional transformations while preserving speaker identity and linguistic content. The results demonstrate improved emotion transfer and show that style-transferred speech can be used for data augmentation to improve emotion recognition.

Soumya Dutta• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal Sentiment AnalysisCMU-MOSI--
144
Emotion Recognition in ConversationMELD (test)
Weighted F169.5
143
Multimodal Emotion Recognition in ConversationIEMOCAP 6-class (test)
Weighted F1 Score (WF1)70.9
44
Speech Emotion RecognitionRAVDESS
Unweighted Accuracy62
43
Emotion TransferESD, TIMIT, and CREMA-D Evaluation Suite (test)
SSST0.69
20
Speech Emotion RecognitionMELD--
19
Rhythm TransferESD, TIMIT, and CREMA-D Evaluation Suite (test)
SSST68
10
Depression DetectionDAIC-WOZ
Weighted F1-score68.5
8
Speech Emotion RecognitionIEMOCAP 4
Weighted F1-score69.4
8
Speech Emotion RecognitionIEMOCAP-6
Weighted F155
8
Showing 10 of 16 rows

Other info

Follow for update