Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A vector quantized masked autoencoder for speech emotion recognition

About

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER.

Samir Sadok, Simon Leglaive, Renaud S\'eguier• 2023

Related benchmarks

TaskDatasetResultRank
Speech Emotion RecognitionIEMOCAP (test)
Accuracy66.4
20
Emotion RecognitionRAVDESS 7-class
WAR83.2
19
Emotion RecognitionRAVDESS (test)
Accuracy0.841
17
Emotion RecognitionCREMA-D 6-class
WAR78.4
17
Song Emotion RecognitionRAVDESS Song
Weighted Accuracy85.8
11
Speech Emotion RecognitionRAVDESS (6-fold subject-independent cross-validation)
Weighted Accuracy (WA)84.8
8
Speech Emotion RecognitionRAVDESS-Song (test)
Accuracy85.8
5
Speech Emotion RecognitionEMODB (test)
Accuracy90.2
5
Showing 8 of 8 rows

Other info

Code

Follow for update