Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Simple Baseline for Audio-Visual Scene-Aware Dialog

About

The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20\% on CIDEr.

Idan Schwartz, Alexander Schwing, Tamir Hazan• 2019

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy74.53
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)67.4
59
Audio Question AnsweringMUSIC-AVQA 1.0 (test)
Counting Accuracy72.41
43
Overall Audio-Visual Question AnsweringMUSIC-AVQA (test)
Overall Accuracy67.44
21
Audio Question AnsweringMUSIC-AVQA (test)
Accuracy (Avg)68.52
17
Visual Question AnsweringMUSIC-AVQA v1.0 (test)
Accuracy (Count)0.6739
16
Audio-Visual Question AnsweringMUSIC-AVQA-R (test)
Audio QA Count (Head)54
13
Visual Question AnsweringMUSIC-AVQA (test)
Accuracy (Counting)67.39
12
Audio-Visual Scene-Aware DialogAVSD (test)
CIDEr0.905
11
Audio-Visual Question AnsweringAVQA (val)
Existence Accuracy81.61
9
Showing 10 of 10 rows

Other info

Follow for update