Audio-Visual Event Localization in Unconstrained Videos
About
In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Event Localization | AVE (test) | Accuracy74 | 37 | |
| Audio-Visual Event Localization | AVE | Accuracy74 | 35 | |
| Audio-Visual Video Parsing | LLP 1.0 (test) | Segment-level Audio47.2 | 13 | |
| Audio-Visual Video Parsing | LLP (test) | Audio Segment Score49.9 | 11 | |
| Open-Vocabulary Audio-Visual Event Localization | AVEBench OV (seen) | Accuracy76.6 | 8 | |
| Open-Vocabulary Audio-Visual Event Localization | OV-AVEBench (total) | Accuracy53.8 | 8 | |
| Open-Vocabulary Audio-Visual Event Localization | OV-AVEBench (unseen) | Accuracy44.6 | 8 | |
| Image Guided Audio Temporal Localization | LLP (test) | F1 Score35.47 | 5 | |
| Image Guided Audio Temporal Localization | AudioSet Strong (test) | F1 Score37.42 | 5 | |
| Audio localization from visual segment query | AVE | V2A34.8 | 4 |