Effective Pre-Training of Audio Transformers for Sound Event Detection
About
We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.
Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schl\"uter, Paul Primus, Gerhard Widmer• 2024
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sound Event Detection | AudioSet Strong (407 classes) | PSDS1A0.47 | 12 |
Showing 1 of 1 rows