Slow-Fast Auditory Streams For Audio Recognition
About
We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state-of-the-art results on both.
Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen• 2021
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | EPIC-KITCHENS 100 (test) | Top-1 Verb Acc46.5 | 101 | |
| Audio Classification | VGG-Sound | Top-1 Accuracy0.524 | 50 | |
| Action Recognition | Epic Kitchens 100 | Top-1 Acc15.4 | 26 | |
| Audio-Visual Classification | VGGSound | -- | 24 | |
| Audio Recognition | VGG-Sound (test) | Top-1 Acc52.46 | 22 | |
| Action Classification | Epic Kitchens 100 | -- | 22 | |
| Action Recognition | EPIC-SOUNDS | Top-1 Accuracy53.8 | 17 | |
| Noun recognition | Epic Kitchens 100 | Top-1 Acc22.8 | 13 | |
| Verb recognition | Epic Kitchens 100 | Top-1 Acc46.5 | 13 | |
| Audio Recognition | Epic-Kitchens-100 (val) | Overall Top-1 Verb Acc46.05 | 7 |
Showing 10 of 13 rows