Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

About

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland• 2025

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech 960h (test-other)
WER2.9
88
Automatic Speech RecognitionLibriSpeech 960h (test-clean)
WER0.016
60
Audio Representation EvaluationHEAR (Holistic Evaluation of Audio Representations)
HEAR Average83.41
47
Audio Event TaggingAudioSet AS-2M (full)
mAP50
45
Automatic Speech RecognitionLibriSpeech 100h (test-clean)
WER2.4
43
Speech ProcessingSUPERB
KWS Acc0.9812
24
Audio UnderstandingX-Ares
ASV201599.81
21
Automatic Speech RecognitionLibriSpeech 100h (test-other)
Word Error Rate4.6
21
Music UnderstandingX-Ares
FMA Score64.16
19
Speech UnderstandingX-Ares
CREMA-D Score78.41
19
Showing 10 of 11 rows

Other info

Follow for update