Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

About

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.

Ysobel Sims, Alexandre Mendes, Stephan Chalup• 2024

Related benchmarks

TaskDatasetResultRank
Music Genre ClassificationGTZAN
Accuracy58.03
62
Environmental Sound ClassificationESC-50 (fold 2)
Accuracy31.1
4
Environmental Sound ClassificationARCA23K-FSD (fold 0)
Accuracy28.79
4
Environmental Sound ClassificationARCA23K-FSD (fold 1)
Accuracy34.18
4
Environmental Sound ClassificationARCA23K-FSD fold 2
Accuracy48.19
4
Environmental Sound ClassificationARCA23K FSD (fold 4)
Accuracy28.33
4
Environmental Sound ClassificationARCA23K-FSD fold 5
Accuracy43.35
4
Environmental Sound ClassificationFSC22 (val)
Accuracy32.69
4
Environmental Sound ClassificationTAU 2019
Accuracy48.57
4
Environmental Sound ClassificationESC-50 (fold 1)
Accuracy46.65
4
Showing 10 of 16 rows

Other info

Follow for update