Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Can Masked Autoencoders Also Listen to Birds?

About

Masked Autoencoders (MAEs) learn rich semantic representations in audio classification through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, revealing the performance limitations of general-domain Audio-MAEs. This work demonstrates that bridging this domain gap domain gap requires full-pipeline adaptation, not just domain-specific pretraining data. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37 percentage points in mean average precision and narrow the gap to fine-tuning across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.

Lukas Rauch, Ren\'e Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, Christoph Scholz• 2025

Related benchmarks

TaskDatasetResultRank
Bioacoustic AnalysisVocal Repertoire
ROC AUC81.2
20
Bioacoustic ClassificationBeans
Probe Accuracy76.6
20
Bioacoustic DetectionBEANS Detection
Probe mAP35.4
20
Bioacoustic IdentificationIndividual ID
Probe Accuracy40.4
20
Audio ClassificationAudioSet 20k (train test)
mAP21.89
19
Bioacoustic DetectionBirdSet
mAP (Probe)16.8
19
Audio ClassificationFSD50K (train/test)
mAP49.65
9
Showing 7 of 7 rows

Other info

Follow for update