emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation

About

We propose emotion2vec, a universal speech emotion representation model. emotion2vec is pre-trained on open-source unlabeled emotion data through self-supervised online distillation, combining utterance-level loss and frame-level loss during pre-training. emotion2vec outperforms state-of-the-art pre-trained universal models and emotion specialist models by only training linear layers for the speech emotion recognition task on the mainstream IEMOCAP dataset. In addition, emotion2vec shows consistent improvements among 10 different languages of speech emotion recognition datasets. emotion2vec also shows excellent results on other emotion tasks, such as song emotion recognition, emotion prediction in conversation, and sentiment analysis. Comparison experiments, ablation experiments, and visualization comprehensively demonstrate the universal capability of the proposed emotion2vec. To the best of our knowledge, emotion2vec is the first universal representation model in various emotion-related tasks, filling a gap in the field.

Ziyang Ma, Zhisheng Zheng, Jiaxin Ye, Jinchao Li, Zhifu Gao, Shiliang Zhang, Xie Chen• 2023

Related benchmarks

Task	Dataset	Result
Multimodal Sentiment Analysis	CMU-MOSI	--	166
Emotion Recognition	MELD (test)	--	89
Sentiment Analysis	CMU-MOSEI	WF10.7656	60
Speech Emotion Recognition	RAVDESS	Unweighted Accuracy82.86	43
Speech Emotion Recognition	IEMOCAP (five-fold/ten-fold cross-validation)	WA77.64	25
Speech Emotion Recognition	MELD	--	24
Speech Emotion Recognition	SUBESCO Bengali (Bn)	Weighted Accuracy90.91	17
Speech Emotion Recognition	MELD In-Domain v1 (test)	Accuracy45.04	14
Speech Emotion Recognition	Emo-Emilia Zero-Shot v1 (test)	Accuracy (ACC)52.79	13
Speech Emotion Recognition	EMOVO Zero-Shot v1 (test)	Accuracy33.53	13

Showing 10 of 44 rows

Other info

Code

Follow for update

@wizwand_team Discord