Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SoundNet: Learning Sound Representations from Unlabeled Video

About

We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

Yusuf Aytar, Carl Vondrick, Antonio Torralba• 2016

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy74.2
325
Environmental Sound ClassificationESC-50 (5-fold cross-validation)
Accuracy74.2
33
Sound classificationDCASE
Accuracy88
15
Audio event classificationDCASE (official)
Top-1 Accuracy88
9
Sound RecognitionDCASE 2014 (test)
Top-1 Accuracy88
8
Liquid mass estimationWilson (test)
MAE (Plastic, Semiconical, Water)3.2
7
Audio ClassificationDCASE 2014
Accuracy88
6
Showing 7 of 7 rows

Other info

Follow for update