End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network
About
While efficient architectures and a plethora of augmentations for end-to-end image classification tasks have been suggested and heavily investigated, state-of-the-art techniques for audio classifications still rely on numerous representations of the audio signal together with large architectures, fine-tuned from large datasets. By utilizing the inherited lightweight nature of audio and novel audio augmentations, we were able to present an efficient end-to-end network with strong generalization ability. Experiments on a variety of sound classification sets demonstrate the effectiveness and robustness of our approach, by achieving state-of-the-art results in various settings. Public code is available at: \href{https://github.com/Alibaba-MIIL/AudioClassfication}{this http url}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy96.3 | 325 | |
| Audio Classification | ESC-50 (test) | Accuracy96.3 | 84 | |
| Audio Classification | AudioSet 2M | mAP42.6 | 79 | |
| Keyword Spotting | Speech Commands V2 | Accuracy98.15 | 61 | |
| Audio Recognition | Speech Commands V2 | Accuracy98.15 | 43 | |
| Sound classification | AudioSet (evaluation) | mAP42.6 | 39 | |
| Audio Classification | UrbanSound8K (official 10 fold split) | Accuracy (%)90 | 10 | |
| Audio Classification | Speech Commands 35 classes V2 (evaluation) | Accuracy98.15 | 3 |