Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MSR-HuBERT: Self-supervised Pre-training for Adaptation to Multiple Sampling Rates

About

Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-rate adaptive pre-training method. Building on HuBERT, we replace its single-rate downsampling CNN with a multi-sampling-rate adaptive downsampling CNN that maps raw waveforms from different sampling rates to a shared temporal resolution without resampling. This design enables unified mixed-rate pre-training and fine-tuning. In experiments spanning 16 to 48 kHz, MSRHuBERT outperforms HuBERT on speech recognition and full-band speech reconstruction, preserving high-frequency detail while modeling low-frequency semantic structure. Moreover, MSRHuBERT retains HuBERT's mask-prediction objective and Transformer encoder, so existing analyses and improvements that were developed for HuBERT can apply directly.

Zikang Huang, Meng Ge, Tianrui Wang, Xuanchen Li, Xiaobao Wang, Longbiao Wang, Jianwu Dang• 2026

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionSUPERB 16kHz
WER5.89
12
Automatic Speech RecognitionSUPERB 22.05kHz
WER2.9
12
Automatic Speech RecognitionSUPERB 24kHz
WER6.35
12
Automatic Speech RecognitionSUPERB 48kHz
WER5.56
12
Speech ReconstructionFull-band SR 16kHz
STOI90.26
6
Speech ReconstructionFull-band SR 22.05kHz
STOI94.38
6
Speech ReconstructionFull-band SR 24kHz
STOI89.25
6
Speech ReconstructionFull-band SR 48kHz
STOI85.79
6
Showing 8 of 8 rows

Other info

Follow for update