Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment

About

In this paper, we investigate distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. Our experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. We retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. For distillation, using this model as a teacher, we generate pseudo-labels on unlabeled degraded speech signals and train student models of varying sizes. For pruning, we use a data-driven strategy. While data-driven pruning performs better at larger model sizes, distillation on unlabeled data is more effective for smaller model sizes. Distillation can halve the gap between the baseline's correlation with ground-truth MOS labels and that of the XLS-R-based teacher model, while reducing model size by two orders of magnitude compared to the teacher model.

Benjamin Stahl, Hannes Gamper• 2025

Related benchmarks

Task	Dataset	Result
Preference Evaluation	NISQA-P501	Acc@0.575	15
Preference Evaluation	CHiME UDASE 7 (test)	Acc@0.555	15
Preference Evaluation	URGENT25-SQA	Acc@0.555	15
Preference Evaluation	NISQA-FOR	Acc@0.569	15
Preference Evaluation	SOMOS	Acc@0.548	15
Preference Evaluation	TMHINT-QI	Acc@0.547	15
Preference Evaluation	SpeechEval	Acc@0.564	15
Preference Evaluation	URGENT SQA 24	Acc@0.554	15
Preference Evaluation	SpeechJudge	Acc@0.57	15
Speech Quality Assessment	BC 19	LCC0.84	12

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord