Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

About

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.

Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin• 2025

Related benchmarks

TaskDatasetResultRank
Preference EvaluationSpeechEval
Acc@0.579
15
Preference EvaluationTMHINT-QI
Acc@0.557
15
Preference EvaluationSpeechJudge
Acc@0.545
15
Preference EvaluationNISQA-P501
Acc@0.568
15
Preference EvaluationNISQA-FOR
Acc@0.553
15
Preference EvaluationCHiME UDASE 7 (test)
Acc@0.546
15
Preference EvaluationURGENT25-SQA
Acc@0.550
15
Preference EvaluationSOMOS
Acc@0.535
15
Preference EvaluationURGENT SQA 24
Acc@0.543
15
Showing 9 of 9 rows

Other info

Follow for update