SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
About
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Preference Evaluation | SpeechEval | Acc@0.579 | 15 | |
| Preference Evaluation | TMHINT-QI | Acc@0.557 | 15 | |
| Preference Evaluation | SpeechJudge | Acc@0.545 | 15 | |
| Preference Evaluation | NISQA-P501 | Acc@0.568 | 15 | |
| Preference Evaluation | NISQA-FOR | Acc@0.553 | 15 | |
| Preference Evaluation | CHiME UDASE 7 (test) | Acc@0.546 | 15 | |
| Preference Evaluation | URGENT25-SQA | Acc@0.550 | 15 | |
| Preference Evaluation | SOMOS | Acc@0.535 | 15 | |
| Preference Evaluation | URGENT SQA 24 | Acc@0.543 | 15 |