Prompt Amplification and Zero-Shot Late Fusion in Audio-Language Models for Speech Emotion Recognition

About

Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a popular choice, but their potential to work with specialists to achieve state-of-the-art (SOTA) performance remains unexplored. We propose ZS-Fuse, a late-fusion method that combines zero-shot emotion estimates from a dual-encoder ALM with specialist FMs. To handle ambiguity in emotions and sensitivity to prompt choice, 1) we use a simple prompt ensemble and 2) suggest a novel technique called prompt amplification, which repeats audio and text queries to discover stronger zero-shot capabilities. We demonstrate the efficacy of our technique by evaluating ZS-Fuse with three dual-encoder ALMs and two FMs, and report improvements over SOTA baselines, such as WavLM-Large, on three speech emotion recognition datasets.

Saurabh Kataria, Xiao Hu• 2026

Related benchmarks

Task	Dataset	Result
Speech Emotion Recognition	RAVDESS	Unweighted Accuracy87.11	43
Speech Emotion Recognition	IEMOCAP	UA71.84	22
Speech Emotion Recognition	MSP-Podcast	UA31.18	22

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord