Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech
About
Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| non-intrusive speech quality assessment | AudioMOS (test) | UTT MSE0.282 | 6 | |
| Speech Quality Assessment | NISQA FOR (test) | -- | 5 | |
| Speech Quality Assessment (MOS Prediction) | Tencent w/o R (test) | MSE0.751 | 4 | |
| Speech Quality Assessment (MOS Prediction) | Tencent w R (test) | MSE0.421 | 4 | |
| Speech Quality Assessment (MOS Prediction) | TCD-VoIP (test) | MSE0.615 | 4 | |
| Speech Quality Assessment (MOS Prediction) | NISQA P501 (test) | MSE0.463 | 4 | |
| Speech Quality Assessment (MOS Prediction) | NISQA LiveTalk (test) | MSE0.418 | 4 |