USAD: Universal Speech and Audio Representation via Distillation

About

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu• 2025

Related benchmarks

Task	Dataset	Result
Audio Representation Evaluation	HEAR (Holistic Evaluation of Audio Representations)	HEAR Average79.36	47
Audio Event Tagging	AudioSet AS-2M (full)	mAP48.6	45
Automatic Speech Recognition	LibriSpeech 100h (test-clean)	WER4	43
Speech Processing	SUPERB	KWS Acc0.971	24
Automatic Speech Recognition	LibriSpeech 100h (test-other)	Word Error Rate7.7	21
Bioacoustic Analysis	Beans	wtkn86.1	20
Audio Tagging	AudioSet balanced (AS-20k)	mAP38.9	14

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord