Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
About
Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Emotion Recognition | RAVDESS | Weighted Accuracy32.4 | 19 | |
| Emotion Recognition | CREMA-D | WA (Weighted Average)30.2 | 12 | |
| Age Classification | CREMA-D | WA38.5 | 5 | |
| Gender Classification | RAVDESS | Weighted Accuracy95.6 | 5 |
Showing 4 of 4 rows