Effective Pre-Training of Audio Transformers for Sound Event Detection

About

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schl\"uter, Paul Primus, Gerhard Widmer• 2024

Related benchmarks

Task	Dataset	Result	Rank
Sound Event Detection	AudioSet Strongly-labeled (test)	--		18
Sound Event Detection	AudioSet Strong (407 classes)	PSDS1A0.47		12

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord