AudioProtoPNet: An interpretable deep learning model for bird sound classification

About

Deep learning models have significantly advanced acoustic bird monitoring by being able to recognize numerous bird species based on their vocalizations. However, traditional deep learning models are black boxes that provide no insight into their underlying computations, limiting their usefulness to ornithologists and machine learning engineers. Explainable models could facilitate debugging, knowledge discovery, trust, and interdisciplinary collaboration. This study introduces AudioProtoPNet, an adaptation of the Prototypical Part Network (ProtoPNet) for multi-label bird sound classification. It is an inherently interpretable model that uses a ConvNeXt backbone to extract embeddings, with the classification layer replaced by a prototype learning classifier trained on these embeddings. The classifier learns prototypical patterns of each bird species' vocalizations from spectrograms of training instances. During inference, audio recordings are classified by comparing them to the learned prototypes in the embedding space, providing explanations for the model's decisions and insights into the most informative embeddings of each bird species. The model was trained on the BirdSet training dataset, which consists of 9,734 bird species and over 6,800 hours of recordings. Its performance was evaluated on the seven test datasets of BirdSet, covering different geographical regions. AudioProtoPNet outperformed the state-of-the-art model Perch, achieving an average AUROC of 0.90 and a cmAP of 0.42, with relative improvements of 7.1% and 16.7% over Perch, respectively. These results demonstrate that even for the challenging task of multi-label bird sound classification, it is possible to develop powerful yet inherently interpretable deep learning models that provide valuable insights for ornithologists and machine learning engineers.

Ren\'e Heinrich, Lukas Rauch, Bernhard Sick, Christoph Scholz• 2024

Related benchmarks

Task	Dataset	Result
Audio Deepfake Detection	WaveFake MelGAN (test)	EER0.00e+0	63
Multi-label bioacoustic classification	BirdSet POW	cmAP52	57
Multi-label bioacoustic classification	BirdSet PER	cmAP30	57
Multi-label bioacoustic classification	BirdSet HSN	cmAP55	57
Audio Deepfake Detection	WaveFake Average (test)	aEER0.6	21
Audio Deepfake Detection	WaveFake MelGAN (L) (test)	EER0.00e+0	21
Audio Deepfake Detection	WaveFake HiFi-GAN (test)	EER0.00e+0	21
Audio Deepfake Detection	WaveFake PWG (test)	EER0.00e+0	21
Audio Deepfake Detection	WaveFake WaveGlow (test)	EER0.00e+0	21
Multi-label bioacoustic classification	BirdSet UHH	cmAP32	3

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord