A Simple Baseline for Audio-Visual Scene-Aware Dialog

About

The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20\% on CIDEr.

Idan Schwartz, Alexander Schwing, Tamir Hazan• 2019

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA 1.0 (test)	AV Localis Accuracy74.53	96
Audio-Visual Question Answering	MUSIC-AVQA (test)	Acc (Avg)67.4	94
Audio Question Answering	MUSIC-AVQA 1.0 (test)	Counting Accuracy72.41	43
Audio-Visual Question Answering	MUSIC-AVQA-R (test)	Audio QA Count (Head)54	41
Overall Audio-Visual Question Answering	MUSIC-AVQA (test)	Overall Accuracy67.44	21
Audio Question Answering	MUSIC-AVQA (test)	Accuracy (Avg)68.52	17
Visual Question Answering	MUSIC-AVQA v1.0 (test)	Accuracy (Count)0.6739	16
Audio-Visual Question Answering	MUSIC-AVQA	Audio Count Acc72.47	14
Visual Question Answering	MUSIC-AVQA (test)	Accuracy (Counting)67.39	12
Audio-Visual Scene-Aware Dialog	AVSD (test)	CIDEr0.905	11

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord