Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Slow-Fast Auditory Streams For Audio Recognition

About

We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state-of-the-art results on both.

Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen• 2021

Related benchmarks

TaskDatasetResultRank
Action RecognitionEPIC-KITCHENS 100 (test)
Top-1 Verb Acc46.5
101
Audio ClassificationVGG-Sound
Top-1 Accuracy0.524
50
Action RecognitionEpic Kitchens 100
Top-1 Acc15.4
26
Audio-Visual ClassificationVGGSound--
24
Audio RecognitionVGG-Sound (test)
Top-1 Acc52.46
22
Action ClassificationEpic Kitchens 100--
22
Action RecognitionEPIC-SOUNDS
Top-1 Accuracy53.8
17
Noun recognitionEpic Kitchens 100
Top-1 Acc22.8
13
Verb recognitionEpic Kitchens 100
Top-1 Acc46.5
13
Audio RecognitionEpic-Kitchens-100 (val)
Overall Top-1 Verb Acc46.05
7
Showing 10 of 13 rows

Other info

Code

Follow for update