Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

About

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.

Mohammed Ali El Adlouni, Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid• 2026

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy91.9
441
Musical Instrument ClassificationNSynth
Accuracy74.9
117
Music Genre ClassificationGTZAN
Accuracy85.6
62
Audio ClassificationOpenMIC
mAP84.8
22
Audio ClassificationUS8K
Top-1 Accuracy86.2
19
Audio TaggingMTT
MTT AP40.2
19
Audio TaggingFSD50K
mAP48.6
11
Showing 7 of 7 rows

Other info

Follow for update