USAD: Universal Speech and Audio Representation via Distillation
About
Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Representation Evaluation | HEAR (Holistic Evaluation of Audio Representations) | HEAR Average79.36 | 47 | |
| Audio Event Tagging | AudioSet AS-2M (full) | mAP48.6 | 45 | |
| Automatic Speech Recognition | LibriSpeech 100h (test-clean) | WER4 | 43 | |
| Speech Processing | SUPERB | KWS Acc0.971 | 24 | |
| Automatic Speech Recognition | LibriSpeech 100h (test-other) | Word Error Rate7.7 | 21 | |
| Bioacoustic Analysis | Beans | wtkn86.1 | 20 | |
| Audio Tagging | AudioSet balanced (AS-20k) | mAP38.9 | 14 |