Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

About

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos• 2025

Related benchmarks

TaskDatasetResultRank
Speech Emotion RecognitionMELD--
24
Speech Emotion RecognitionIEMOCAP
UA67.78
22
Speech Emotion RecognitionMSP-Podcast 2.0 (test 3)
WAR33.91
8
Speech Emotion RecognitionMSP-Podcast 2.0 (test 1)
Weighted Accuracy Rate (WAR)41.57
5
Speech Emotion RecognitionMSP-Podcast 2.0 (test)
Weighted Accuracy (WAR)42.26
5
Showing 5 of 5 rows

Other info

Follow for update