MEDUSA: A Multimodal Deep Fusion Multi-Stage Training Framework for Speech Emotion Recognition in Naturalistic Conditions
About
SER is a challenging task due to the subjective nature of human emotions and their uneven representation under naturalistic conditions. We propose MEDUSA, a multimodal framework with a four-stage training pipeline, which effectively handles class imbalance and emotion ambiguity. The first two stages train an ensemble of classifiers that utilize DeepSER, a novel extension of a deep cross-modal transformer fusion mechanism from pretrained self-supervised acoustic and linguistic representations. Manifold MixUp is employed for further regularization. The last two stages optimize a trainable meta-classifier that combines the ensemble predictions. Our training approach incorporates human annotation scores as soft targets, coupled with balanced data sampling and multitask learning. MEDUSA ranked 1st in Task 1: Categorical Emotion Recognition in the Interspeech 2025: Speech Emotion Recognition in Naturalistic Conditions Challenge.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Emotion Recognition | IEMOCAP | UA67.78 | 22 | |
| Speech Emotion Recognition | MELD | -- | 19 | |
| Speech Emotion Recognition | MSP-Podcast 2.0 (test 3) | WAR33.91 | 8 | |
| Speech Emotion Recognition | MSP-Podcast 2.0 (test 1) | Weighted Accuracy Rate (WAR)41.57 | 5 | |
| Speech Emotion Recognition | MSP-Podcast 2.0 (test) | Weighted Accuracy (WAR)42.26 | 5 |