UniSRM: A Unified Speech Reward Model for Reasoning-Based Fine-grained Assessment
About
Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| utterance-level pairwise preference judgement | UniSRM-BENCH T1 | Accuracy65.06 | 12 | |
| Speech Quality Assessment | BVCC | -- | 12 | |
| multi-turn dialogue speech evaluation | UniSRM-BENCH T4 | Accuracy88.89 | 10 | |
| scenario-aware style consistency preference (Chinese) | UniSRM-BENCH T3-Zh | Accuracy91.3 | 10 | |
| scenario-aware style consistency preference (English) | UniSRM-BENCH T3-En | Accuracy85.61 | 10 | |
| fine-grained speech quality scoring | UniSRM-BENCH T2 | PCC0.551 | 9 | |
| Speech Quality Evaluation | SOMOS Clean | PCC0.2612 | 5 | |
| Speech Quality Evaluation | SOMOS (Full) | PCC0.2347 | 5 |