Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

About

Emotion recognition in conversations (ERC), the task of recognizing the emotion of each utterance in a conversation, is crucial for building empathetic machines. Existing studies focus mainly on capturing context- and speaker-sensitive dependencies on the textual modality but ignore the significance of multimodal information. Different from emotion recognition in textual conversations, capturing intra- and inter-modal interactions between utterances, learning weights between different modalities, and enhancing modal representations play important roles in multimodal ERC. In this paper, we propose a transformer-based model with self-distillation (SDT) for the task. The transformer-based model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers, and learns weights between modalities dynamically by designing a hierarchical gated fusion strategy. Furthermore, to learn more expressive modal representations, we treat soft labels of the proposed model as extra training supervision. Specifically, we introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality. Experiments on IEMOCAP and MELD datasets demonstrate that SDT outperforms previous state-of-the-art baselines.

Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, Bo Xu• 2023

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionIEMOCAP--
115
Multimodal Emotion Recognition in ConversationMELD standard (test)
WF166.6
53
Multimodal Emotion Recognition in ConversationMELD
Weighted Avg F1 Score66.6
36
Multimodal Emotion RecognitionIEMOCAP
Accuracy73.95
24
Multimodal Emotion Recognition in ConversationsMELD → IEMOCAP target
Joy Accuracy50.18
15
Cross-scenario Multimodal Emotion RecognitionMELD -> IEMOCAP 20% Noise (test)
Joy Accuracy44.1
15
Cross-scenario Multimodal Emotion Recognition in ConversationsMELD -> IEMOCAP noise rate 40% (test)
Joy Accuracy38.01
15
Cross-scenario Multimodal Emotion RecognitionIEMOCAP -> MELD 20% Noise (test)
Joy Score8.75
15
Multimodal Emotion Recognition in ConversationsIEMOCAP → MELD (target)
Joy Score8.74
15
Cross-scenario Multimodal Emotion Recognition in ConversationsIEMOCAP -> MELD noise rate 40% (test)
Joy Accuracy7.34
15
Showing 10 of 13 rows

Other info

Code

Follow for update