Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Audio-Visual Event Localization in Unconstrained Videos

About

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu• 2018

Related benchmarks

TaskDatasetResultRank
Audio-Visual Event LocalizationAVE (test)
Accuracy74
37
Audio-Visual Event LocalizationAVE
Accuracy74
35
Audio-Visual Video ParsingLLP 1.0 (test)
Segment-level Audio47.2
13
Audio-Visual Video ParsingLLP (test)
Audio Segment Score49.9
11
Open-Vocabulary Audio-Visual Event LocalizationAVEBench OV (seen)
Accuracy76.6
8
Open-Vocabulary Audio-Visual Event LocalizationOV-AVEBench (total)
Accuracy53.8
8
Open-Vocabulary Audio-Visual Event LocalizationOV-AVEBench (unseen)
Accuracy44.6
8
Image Guided Audio Temporal LocalizationLLP (test)
F1 Score35.47
5
Image Guided Audio Temporal LocalizationAudioSet Strong (test)
F1 Score37.42
5
Audio localization from visual segment queryAVE
V2A34.8
4
Showing 10 of 10 rows

Other info

Follow for update