Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CED: Consistent ensemble distillation for audio tagging

About

Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang• 2023

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy97.3
325
Audio ClassificationAudioSet 20K
mAP44
128
Audio ClassificationUrbansound8K
Accuracy87.8
116
Audio ClassificationAudioSet 2M
mAP50
79
Musical Instrument ClassificationNSynth
Accuracy75.6
75
Audio ClassificationSPC V2
Accuracy89
65
Audio ClassificationGTZAN
Accuracy42.3
54
Speech ClassificationVF
Accuracy94.8
47
Emotion RecognitionCRM-D
Accuracy66.1
39
Audio Event TaggingAudioSet AS-2M (full)
mAP50
33
Showing 10 of 13 rows

Other info

Follow for update