Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

About

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee• 2025

Related benchmarks

TaskDatasetResultRank
non-intrusive speech quality assessmentAudioMOS (test)
UTT MSE0.282
6
Speech Quality AssessmentNISQA FOR (test)--
5
Speech Quality Assessment (MOS Prediction)Tencent w/o R (test)
MSE0.751
4
Speech Quality Assessment (MOS Prediction)Tencent w R (test)
MSE0.421
4
Speech Quality Assessment (MOS Prediction)TCD-VoIP (test)
MSE0.615
4
Speech Quality Assessment (MOS Prediction)NISQA P501 (test)
MSE0.463
4
Speech Quality Assessment (MOS Prediction)NISQA LiveTalk (test)
MSE0.418
4
Showing 7 of 7 rows

Other info

Follow for update