Slow-Fast Auditory Streams For Audio Recognition

About

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state-of-the-art results on both.

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen• 2021

Related benchmarks

Task	Dataset	Result
Action Recognition	EPIC-KITCHENS 100 (test)	Top-1 Verb Acc46.5	101
Audio Classification	VGG-Sound	Top-1 Accuracy0.524	83
Audio-Visual Classification	VGGSound	--	37
Action Recognition	Epic Kitchens 100	Top-1 Acc15.4	26
Audio Recognition	VGG-Sound (test)	Top-1 Acc52.46	22
Action Classification	Epic Kitchens 100	--	22
Action Recognition	EPIC-SOUNDS	Top-1 Accuracy53.8	17
Noun recognition	Epic Kitchens 100	Top-1 Acc22.8	13
Verb recognition	Epic Kitchens 100	Top-1 Acc46.5	13
Audio Recognition	Epic-Kitchens-100 (val)	Overall Top-1 Verb Acc46.05	7

Showing 10 of 13 rows

Other info

Code

Follow for update

@wizwand_team Discord