Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Effective Pre-Training of Audio Transformers for Sound Event Detection

About

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schl\"uter, Paul Primus, Gerhard Widmer• 2024

Related benchmarks

TaskDatasetResultRank
Sound Event DetectionAudioSet Strongly-labeled (test)--
18
Sound Event DetectionAudioSet Strong (407 classes)
PSDS1A0.47
12
Showing 2 of 2 rows

Other info

Follow for update