SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
About
Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech 960h (test-other) | WER2.9 | 88 | |
| Automatic Speech Recognition | LibriSpeech 960h (test-clean) | WER0.016 | 60 | |
| Audio Representation Evaluation | HEAR (Holistic Evaluation of Audio Representations) | HEAR Average83.41 | 47 | |
| Audio Event Tagging | AudioSet AS-2M (full) | mAP50 | 45 | |
| Automatic Speech Recognition | LibriSpeech 100h (test-clean) | WER2.4 | 43 | |
| Speech Processing | SUPERB | KWS Acc0.9812 | 24 | |
| Audio Understanding | X-Ares | ASV201599.81 | 21 | |
| Automatic Speech Recognition | LibriSpeech 100h (test-other) | Word Error Rate4.6 | 21 | |
| Music Understanding | X-Ares | FMA Score64.16 | 19 | |
| Speech Understanding | X-Ares | CREMA-D Score78.41 | 19 |