Ambient Sound Provides Supervision for Visual Learning
About
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | PASCAL VOC 2007 (test) | mAP44 | 821 | |
| Classification | PASCAL VOC 2007 (test) | mAP (%)61.3 | 217 | |
| Scene Classification | Places 205 categories (test) | Top-1 Acc0.321 | 150 | |
| Object Detection | PASCAL VOC 2007 | mAP44 | 49 | |
| Image Classification | Places 205-way (test) | -- | 38 | |
| Classification | Pascal VOC | mAP61.3 | 27 | |
| Event Classification | AudioSet and VGGSound (test) | mAP37.1 | 8 |