A Transformer-Based Model With Self-Distillation for Multimodal Emotion Recognition in Conversations

About

Emotion recognition in conversations (ERC), the task of recognizing the emotion of each utterance in a conversation, is crucial for building empathetic machines. Existing studies focus mainly on capturing context- and speaker-sensitive dependencies on the textual modality but ignore the significance of multimodal information. Different from emotion recognition in textual conversations, capturing intra- and inter-modal interactions between utterances, learning weights between different modalities, and enhancing modal representations play important roles in multimodal ERC. In this paper, we propose a transformer-based model with self-distillation (SDT) for the task. The transformer-based model captures intra- and inter-modal interactions by utilizing intra- and inter-modal transformers, and learns weights between modalities dynamically by designing a hierarchical gated fusion strategy. Furthermore, to learn more expressive modal representations, we treat soft labels of the proposed model as extra training supervision. Specifically, we introduce self-distillation to transfer knowledge of hard and soft labels from the proposed model to each modality. Experiments on IEMOCAP and MELD datasets demonstrate that SDT outperforms previous state-of-the-art baselines.

Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, Bo Xu• 2023

Related benchmarks

Task	Dataset	Result
Conversational Emotion Recognition	IEMOCAP	Weighted Average F1 Score74.08	174
Emotion Recognition	IEMOCAP	--	151
Multimodal Emotion Recognition in Conversation	MELD standard (test)	WF166.6	53
Multimodal Emotion Recognition	IEMOCAP	--	48
Multimodal Emotion Recognition in Conversation	MELD	Weighted Avg F1 Score66.6	36
Multimodal Emotion Recognition in Conversations	MELD → IEMOCAP target	Joy Accuracy50.18	15
Cross-scenario Multimodal Emotion Recognition	MELD -> IEMOCAP 20% Noise (test)	Joy Accuracy44.1	15
Cross-scenario Multimodal Emotion Recognition in Conversations	MELD -> IEMOCAP noise rate 40% (test)	Joy Accuracy38.01	15
Cross-scenario Multimodal Emotion Recognition	IEMOCAP -> MELD 20% Noise (test)	Joy Score8.75	15
Multimodal Emotion Recognition in Conversations	IEMOCAP → MELD (target)	Joy Score8.74	15

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord