Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition

About

This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.

Peihao Xiang, Chaohao Lin, Kaida Wu, Ou Bai• 2024

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionIEMOCAP 4-class (test)
WAR63.73
46
Emotion RecognitionRAVDESS 7-class
WAR83.61
19
Emotion RecognitionCREMA-D 6-class
WAR79.36
17
Categorical Emotion RecognitionCREMA-D
UAR79.12
14
Facial Emotion RecognitionRAVDESS
WAR83.61
8
Showing 5 of 5 rows

Other info

Follow for update