MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions

About

SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.

Georgios Chatzichristodoulou, Despoina Kosmopoulou, Antonios Kritikos, Anastasia Poulopoulou, Efthymios Georgiou, Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos• 2025

Related benchmarks

Task	Dataset	Result
Speech Emotion Recognition	MELD	--	24
Speech Emotion Recognition	IEMOCAP	UA67.78	22
Speech Emotion Recognition	MSP-Podcast 2.0 (test 3)	WAR33.91	8
Speech Emotion Recognition	MSP-Podcast 2.0 (test 1)	Weighted Accuracy Rate (WAR)41.57	5
Speech Emotion Recognition	MSP-Podcast 2.0 (test)	Weighted Accuracy (WAR)42.26	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord