Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

About

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4%).

Haytham M. Fayek, Anurag Kumar• 2020

Related benchmarks

Task	Dataset	Result
Audio Classification	AudioSet	mAP38.4	60
Sound classification	AudioSet (evaluation)	mAP46.16	39
Audio-visual Zero-Shot Classification	VGGSound GZSL (test)	S Score14.13	38
Acoustic event detection	AudioSet (test)	mAP0.462	34
Classification	AudioSet AS-2M	--	21
Audio-visual Zero-Shot Classification	UCF GZSL cls (test)	S (Seen Accuracy)39.34	19
Audio-visual Zero-Shot Classification	ActivityNet GZSL cls (test)	S (Seen)11.15	19
Audio-Visual Event Classification	AudioSet 2M	mAP (Audio-only)38.4	16
Audio-Visual Classification	AudioSet (test)	mAP (Audio Only)38.4	6
Audiovisual Classification	AudioSet	mAP46.2	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord