ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event Classification

About

Transformers, which were originally developed for natural language processing, have recently generated significant interest in the computer vision and audio communities due to their flexibility in learning long-range relationships. Constrained by the data hungry nature of transformers and the limited amount of labelled data, most transformer-based models for audio tasks are finetuned from ImageNet pretrained models, despite the huge gap between the domain of natural images and audio. This has motivated the research in self-supervised pretraining of audio transformers, which reduces the dependency on large amounts of labeled data and focuses on extracting concise representations of audio spectrograms. In this paper, we propose \textbf{L}ocal-\textbf{G}lobal \textbf{A}udio \textbf{S}pectrogram v\textbf{I}sion \textbf{T}ransformer, namely ASiT, a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation. We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification. We further conduct comprehensive ablation studies, including evaluations of different pretraining strategies. The proposed ASiT framework significantly boosts the performance on all tasks and sets a new state-of-the-art performance in five audio and speech classification tasks, outperforming recent methods, including the approaches that use additional datasets for pretraining.

Sara Atito, Muhammad Awais, Wenwu Wang, Mark D Plumbley, Josef Kittler• 2022

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy95.3	441
Audio Classification	AudioSet 20K	mAP38.6	147
Audio Classification	AudioSet 2M	mAP47.5	98
Audio Classification	Speech Commands V2 (test)	Accuracy98.9	46
Audio Event Tagging	AudioSet AS-2M (full)	mAP48	45
Audio Recognition	Speech Commands V2	Accuracy98.9	43
Keyword Spotting	Speech Commands KS1 v1	Accuracy98.2	24
Audio Event Tagging	AudioSet (AS-20K)	mAP38.6	24
Keyword Spotting	Speech Commands KS2 v2	Accuracy98.9	23
Classification	AudioSet AS-2M	mAP (%)48	21

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord