SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

About

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu• 2025

Related benchmarks

Task	Dataset	Result
Preference Evaluation	SpeechJudge	Acc@0.574	15
Preference Evaluation	TMHINT-QI	Acc@0.550	15
Preference Evaluation	NISQA-P501	Acc@0.554	15
Preference Evaluation	NISQA-FOR	Acc@0.542	15
Preference Evaluation	URGENT25-SQA	Acc@0.533	15
Preference Evaluation	SpeechEval	Acc@0.548	15
Preference Evaluation	CHiME UDASE 7 (test)	Acc@0.530	15
Preference Evaluation	SOMOS	Acc@0.528	15
Preference Evaluation	URGENT SQA 24	Acc@0.529	15
utterance-level pairwise preference judgement	UniSRM-BENCH T1	Accuracy57.2	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord