Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

USAD: Universal Speech and Audio Representation via Distillation

About

Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Heng-Jui Chang, Saurabhchand Bhati, James Glass, Alexander H. Liu• 2025

Related benchmarks

TaskDatasetResultRank
Audio Representation EvaluationHEAR (Holistic Evaluation of Audio Representations)
HEAR Average79.36
47
Audio Event TaggingAudioSet AS-2M (full)
mAP48.6
45
Automatic Speech RecognitionLibriSpeech 100h (test-clean)
WER4
43
Speech ProcessingSUPERB
KWS Acc0.971
24
Automatic Speech RecognitionLibriSpeech 100h (test-other)
Word Error Rate7.7
21
Bioacoustic AnalysisBeans
wtkn86.1
20
Audio TaggingAudioSet balanced (AS-20k)
mAP38.9
14
Showing 7 of 7 rows

Other info

Follow for update