SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

About

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech 960h (test-other)	WER2.9	98
Automatic Speech Recognition	LibriSpeech 960h (test-clean)	WER0.016	70
Automatic Speech Recognition	LibriSpeech 100h (test-clean)	WER2.4	64
Audio Representation Evaluation	HEAR (Holistic Evaluation of Audio Representations)	HEAR Average83.41	59
Speech Processing	SUPERB	KWS Acc0.9812	52
Audio Event Tagging	AudioSet AS-2M (full)	mAP50	45
Automatic Speech Recognition	LibriSpeech 100h (test-other)	Word Error Rate4.6	42
Audio Event Tagging	AudioSet (AS-20K)	mAP39.4	39
Audio Understanding	X-Ares	ASV201599.81	21
Music Understanding	X-Ares	FMA Score64.16	19

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord