A Benchmark of Generative Methods for Zero-Shot Environmental Sound Classification

About

Zero-shot learning enables models to generalise to unseen classes using semantic information, bridging the gap between training classes and previously unseen test classes. While widely studied in computer vision, its application to environmental audio remains underexplored, and generative approaches have received little attention. This work presents the first benchmark of generative methods for zero-shot environmental sound classification. Four approaches spanning variational, adversarial, diffusion-based, and denoising paradigms are evaluated. The benchmark includes CADA-VAE and LisGAN, adapted from computer vision, together with two embedding-generation methods introduced in this work: one based on a denoising diffusion probabilistic model (DDPM) and the other on a conditional generative denoising network (CGDN). Experiments on five environmental audio datasets (ESC-50, ARCA23K-FSD, FSC22, UrbanSound8K, and TAU Urban Acoustic Scenes 2019) and one music dataset (GTZAN) show that generative methods are competitive with established compatibility-based approaches. Among the evaluated generative methods, CGDN achieves the highest average accuracy and is the only one to significantly outperform both the DDPM- and GAN-based methods, while remaining statistically indistinguishable from the strong ALE baseline. These findings suggest that optimisation stability is an important factor in generative zero-shot learning for environmental audio.

Ysobel Sims, Alexandre Mendes, Stephan Chalup• 2024

Related benchmarks

Task	Dataset	Result
Music Genre Classification	GTZAN	Accuracy58.03	68
Environmental Sound Classification	Urbansound8K	Accuracy49.61	16
Environmental Sound Classification	ESC-50 (fold 2)	Accuracy31.1	4
Environmental Sound Classification	ARCA23K-FSD (fold 0)	Accuracy28.79	4
Environmental Sound Classification	ARCA23K-FSD (fold 1)	Accuracy34.18	4
Environmental Sound Classification	ARCA23K-FSD fold 2	Accuracy48.19	4
Environmental Sound Classification	ARCA23K FSD (fold 4)	Accuracy28.33	4
Environmental Sound Classification	ARCA23K-FSD fold 5	Accuracy43.35	4
Environmental Sound Classification	FSC22 (val)	Accuracy32.69	4
Environmental Sound Classification	TAU 2019	Accuracy48.57	4

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord