Efficient Training of Audio Transformers with Patchout

About

The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST

Khaled Koutini, Jan Schl\"uter, Hamid Eghbal-zadeh, Gerhard Widmer• 2021

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97	441
Audio Classification	Urbansound8K	Accuracy89.1	126
Musical Instrument Classification	NSynth	Accuracy72.9	117
Audio Classification	AudioSet 2M	mAP47.1	98
Environmental Sound Classification	FSD50K	mAP65.6	91
Audio Classification	ESC-50 (test)	Accuracy96.8	87
Audio Classification	ESC50	Top-1 Acc95.5	64
Audio Classification	GTZAN	Accuracy87.4	59
Classification	AudioSet (test)	mAP49.6	57
Audio Representation Evaluation	HEAR (Holistic Evaluation of Audio Representations)	--	47

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord