CED: Consistent ensemble distillation for audio tagging

About

Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang• 2023

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97.3	441
Audio Classification	AudioSet 20K	mAP44	147
Audio Classification	Urbansound8K	Accuracy87.8	126
Musical Instrument Classification	NSynth	Accuracy75.6	117
Audio Classification	AudioSet 2M	mAP50	98
Audio Classification	SPC V2	Accuracy89	65
Audio Classification	GTZAN	Accuracy42.3	59
Speech Classification	VF	Accuracy94.8	47
Audio Event Tagging	AudioSet AS-2M (full)	mAP50	45
Emotion Recognition	CRM-D	Accuracy66.1	39

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord