MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

About

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

Zikang Huang, Meng Ge, Tianrui Wang, Xuanchen Li, Xiaobao Wang, Longbiao Wang, Jianwu Dang• 2026

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	SUPERB 16kHz	WER5.89	12
Automatic Speech Recognition	SUPERB 22.05kHz	WER2.9	12
Automatic Speech Recognition	SUPERB 24kHz	WER6.35	12
Automatic Speech Recognition	SUPERB 48kHz	WER5.56	12
Speech Reconstruction	Full-band SR 16kHz	STOI90.26	6
Speech Reconstruction	Full-band SR 22.05kHz	STOI94.38	6
Speech Reconstruction	Full-band SR 24kHz	STOI89.25	6
Speech Reconstruction	Full-band SR 48kHz	STOI85.79	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord