SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

About

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. The relevant code, models, and data are publicly available at https://github.com/NKU-HLT/SpeechLLM-as-Judges.

Hui Wang, Jinghua Zhao, Yifan Yang, Shujie Liu, Junyang Chen, Yanzhe Zhang, Shiwan Zhao, Jinyu Li, Jiaming Zhou, Haoqin Sun, Yan Lu, Yong Qin• 2025

Related benchmarks

Task	Dataset	Result
Preference Evaluation	SpeechEval	Acc@0.579	15
Preference Evaluation	TMHINT-QI	Acc@0.557	15
Preference Evaluation	SpeechJudge	Acc@0.545	15
Preference Evaluation	NISQA-P501	Acc@0.568	15
Preference Evaluation	NISQA-FOR	Acc@0.553	15
Preference Evaluation	CHiME UDASE 7 (test)	Acc@0.546	15
Preference Evaluation	URGENT25-SQA	Acc@0.550	15
Preference Evaluation	SOMOS	Acc@0.535	15
Preference Evaluation	URGENT SQA 24	Acc@0.543	15
Speech Quality Assessment	SpeechEval	PCC - Overall0.52	14

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord