Audio-Visual Event Localization in Unconstrained Videos

About

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, Chenliang Xu• 2018

Related benchmarks

Task	Dataset	Result
Audio-Visual Video Parsing	LLP (test)	Audio Segment Score49.9	89
Audio-Visual Event Localization	AVE (test)	Accuracy74	54
Audio-Visual Event Localization	AVE	Accuracy74	52
Action Recognition	KS	Accuracy77.5	15
Open-Vocabulary Audio-Visual Event Localization	OV-AVEBench (unseen)	Accuracy44.6	14
Audio-Visual Video Parsing	LLP 1.0 (test)	Segment-level Audio47.2	13
Audio-Visual Event Localization	OV-AVEBench (Seen)	Accuracy76.6	12
Open-Vocabulary Audio-Visual Event Localization	AVEBench OV (seen)	Accuracy76.6	8
Open-Vocabulary Audio-Visual Event Localization	OV-AVEBench (total)	Accuracy53.8	8
Image Guided Audio Temporal Localization	LLP (test)	F1 Score35.47	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord