Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

About

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior studies explored their layer-wise representations for efficiency and performance, speech quality assessment (SQA) models predominantly rely on last-layer features, leaving intermediate layers underexamined. In this work, we systematically evaluate different layers of multiple SSL models for predicting mean-opinion-score (MOS). Features from each layer are fed into a lightweight regression network to assess effectiveness. Our experiments consistently show early-layers features outperform or match those from the last layer, leading to significant improvements over conventional approaches and state-of-the-art MOS prediction models. These findings highlight the advantages of early-layer selection, offering enhanced performance and reduced system complexity.

Xinyu Liang, Fredrik Cumlin, Victor Ungureanu, Chandan K. A. Reddy, Christian Schuldt, Saikat Chatterjee• 2025

Related benchmarks

Task	Dataset	Result
non-intrusive speech quality assessment	AudioMOS (test)	UTT MSE0.282	6
Speech Quality Assessment	NISQA FOR (test)	--	5
Speech Quality Assessment (MOS Prediction)	Tencent w/o R (test)	MSE0.751	4
Speech Quality Assessment (MOS Prediction)	Tencent w R (test)	MSE0.421	4
Speech Quality Assessment (MOS Prediction)	TCD-VoIP (test)	MSE0.615	4
Speech Quality Assessment (MOS Prediction)	NISQA P501 (test)	MSE0.463	4
Speech Quality Assessment (MOS Prediction)	NISQA LiveTalk (test)	MSE0.418	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord