AVES: Animal Vocalization Encoder based on Self-Supervision

About

The lack of annotated training data in bioacoustics hinders the use of large-scale neural network models trained in a supervised way. In order to leverage a large amount of unannotated audio data, we propose AVES (Animal Vocalization Encoder based on Self-Supervision), a self-supervised, transformer-based audio representation model for encoding animal vocalizations. We pretrain AVES on a diverse set of unannotated audio datasets and fine-tune them for downstream bioacoustics tasks. Comprehensive experiments with a suite of classification and detection tasks have shown that AVES outperforms all the strong baselines and even the supervised "topline" models trained on annotated audio classification datasets. The results also suggest that curating a small training subset related to downstream tasks is an efficient way to train high-quality audio representation models. We open-source our models at \url{https://github.com/earthspecies/aves}.

Masato Hagiwara• 2022

Related benchmarks

Task	Dataset	Result
Acoustic Classification	DeepShip	Accuracy67.9	25
Bioacoustic Analysis	Beans	wtkn87.9	20
Bioacoustic Detection	BEANS Detection	Probe mAP34	20
Bioacoustic Identification	Individual ID	Probe Accuracy40.2	20
Bioacoustic Classification	Beans	Probe Accuracy70.5	20
Bioacoustic Analysis	Vocal Repertoire	ROC AUC72.6	20
Passive Sonar Classification	ShipsEar	Accuracy65.1	19
Bioacoustic Detection	BirdSet	mAP (Probe)9.2	19
Bioacoustic Monitoring	BEANS Acoustic Beehive Monitoring	ROC-AUC (BSTS)90.48	17
Soundscape Classification	BEsound	Anthropic Score54.6	13

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord