Attention Based Fully Convolutional Network for Speech Emotion Recognition

About

Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long utterance; 3) speech data with emotional labeling is usually limited. In this paper, we present a novel attention based fully convolutional network for speech emotion recognition. We employ fully convolutional network as it is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost. The proposed attention mechanism can make our model be aware of which time-frequency region of speech spectrogram is more emotion-relevant. Considering limited data, the transfer learning is also adapted to improve the accuracy. Especially, it's interesting to observe obvious improvement obtained with natural scene image based pre-trained model. Validated on the publicly available IEMOCAP corpus, the proposed model outperformed the state-of-the-art methods with a weighted accuracy of 70.4% and an unweighted accuracy of 63.9% respectively.

Yuanyuan Zhang, Jun Du, Zirui Wang, Jianshu Zhang• 2018

Related benchmarks

Task	Dataset	Result	Rank
Speech Emotion Recognition	LSSED 1.0 (test)	WA57		21

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord