Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

About

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos• 2025

Related benchmarks

TaskDatasetResultRank
Speech Emotion RecognitionIEMOCAP
UA67.78
22
Speech Emotion RecognitionMELD--
19
Speech Emotion RecognitionMSP-Podcast 2.0 (test 3)
WAR33.91
8
Speech Emotion RecognitionMSP-Podcast 2.0 (test 1)
Weighted Accuracy Rate (WAR)41.57
5
Speech Emotion RecognitionMSP-Podcast 2.0 (test)
Weighted Accuracy (WAR)42.26
5
Showing 5 of 5 rows

Other info

Follow for update