HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

About

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, Shlomo Dubnov• 2022

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97	441
Audio Classification	Urbansound8K	Accuracy85.3	126
Musical Instrument Classification	NSynth	Accuracy73.3	117
Audio Classification	AudioSet 2M	mAP47.1	98
Environmental Sound Classification	FSD50K	mAP59.4	91
Audio Classification	ESC-50 (test)	Accuracy97	87
Audio Classification	SPC V2	Accuracy98	65
Music Genre Classification	GTZAN	Accuracy85.1	62
Keyword Spotting	Speech Commands V2	Accuracy98	61
Audio Classification	GTZAN	Accuracy85.9	59

Showing 10 of 31 rows

Other info

Code

Follow for update

@wizwand_team Discord