Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment
About
In this paper, we investigate distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. Our experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. We retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. For distillation, using this model as a teacher, we generate pseudo-labels on unlabeled degraded speech signals and train student models of varying sizes. For pruning, we use a data-driven strategy. While data-driven pruning performs better at larger model sizes, distillation on unlabeled data is more effective for smaller model sizes. Distillation can halve the gap between the baseline's correlation with ground-truth MOS labels and that of the XLS-R-based teacher model, while reducing model size by two orders of magnitude compared to the teacher model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Preference Evaluation | NISQA-P501 | Acc@0.575 | 15 | |
| Preference Evaluation | CHiME UDASE 7 (test) | Acc@0.555 | 15 | |
| Preference Evaluation | URGENT25-SQA | Acc@0.555 | 15 | |
| Preference Evaluation | NISQA-FOR | Acc@0.569 | 15 | |
| Preference Evaluation | SOMOS | Acc@0.548 | 15 | |
| Preference Evaluation | TMHINT-QI | Acc@0.547 | 15 | |
| Preference Evaluation | SpeechEval | Acc@0.564 | 15 | |
| Preference Evaluation | URGENT SQA 24 | Acc@0.554 | 15 | |
| Preference Evaluation | SpeechJudge | Acc@0.57 | 15 | |
| Speech Quality Assessment | BC 19 | LCC0.84 | 12 |