Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition
About
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages. To this end, we propose NOVA ARC, a geometry-aware framework that models affective structure in the Poincar\'e ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens. For unsupervised adaptation, NOVA-ARC performs optimal transport based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization. Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counterparts and strong SSL baselines. To the best of our knowledge, this work is the first to move beyond verbal-speech-centric supervision by introducing a non-verbal-to-verbal transfer paradigm for SER.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Emotion Recognition | MESD | Accuracy90.67 | 16 | |
| Speech Emotion Recognition | APD V | Accuracy92.4 | 8 | |
| Speech Emotion Recognition | AESD APD NV | Accuracy84.39 | 8 | |
| Speech Emotion Recognition | RVDS APD NV | Accuracy93.79 | 8 | |
| Speech Emotion Recognition | EMDB APD NV | Accuracy92.46 | 8 | |
| Speech Emotion Recognition | CRMD APD NV | Accuracy0.9132 | 8 | |
| Speech Emotion Recognition | AESD APD V | Accuracy79.19 | 8 | |
| Speech Emotion Recognition | RVDS APD V | Accuracy86.76 | 8 | |
| Speech Emotion Recognition | EMDB APD V | Accuracy80.59 | 8 | |
| Speech Emotion Recognition | CRMD APD V | Accuracy79.61 | 8 |