Prosody as Supervision: Bridging the Non-Verbal--Verbal for Multilingual Speech Emotion Recognition

About

In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labeled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages. To this end, we propose NOVA ARC, a geometry-aware framework that models affective structure in the Poincar\'e ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens. For unsupervised adaptation, NOVA-ARC performs optimal transport based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization. Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counterparts and strong SSL baselines. To the best of our knowledge, this work is the first to move beyond verbal-speech-centric supervision by introducing a non-verbal-to-verbal transfer paradigm for SER.

Girish, Mohd Mujtaba Akhtar, Muskaan Singh• 2026

Related benchmarks

Task	Dataset	Result
Speech Emotion Recognition	MESD	Accuracy90.67	16
Speech Emotion Recognition	APD V	Accuracy92.4	8
Speech Emotion Recognition	AESD APD NV	Accuracy84.39	8
Speech Emotion Recognition	RVDS APD NV	Accuracy93.79	8
Speech Emotion Recognition	EMDB APD NV	Accuracy92.46	8
Speech Emotion Recognition	CRMD APD NV	Accuracy0.9132	8
Speech Emotion Recognition	AESD APD V	Accuracy79.19	8
Speech Emotion Recognition	RVDS APD V	Accuracy86.76	8
Speech Emotion Recognition	EMDB APD V	Accuracy80.59	8
Speech Emotion Recognition	CRMD APD V	Accuracy79.61	8

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord