Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

About

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu• 2025

Related benchmarks

TaskDatasetResultRank
Preference EvaluationSpeechJudge
Acc@0.574
15
Preference EvaluationTMHINT-QI
Acc@0.550
15
Preference EvaluationNISQA-P501
Acc@0.554
15
Preference EvaluationNISQA-FOR
Acc@0.542
15
Preference EvaluationURGENT25-SQA
Acc@0.533
15
Preference EvaluationSpeechEval
Acc@0.548
15
Preference EvaluationCHiME UDASE 7 (test)
Acc@0.530
15
Preference EvaluationSOMOS
Acc@0.528
15
Preference EvaluationURGENT SQA 24
Acc@0.529
15
Showing 9 of 9 rows

Other info

Follow for update