A Simple Baseline for Audio-Visual Scene-Aware Dialog
About
The recently proposed audio-visual scene-aware dialog task paves the way to a more data-driven way of learning virtual assistants, smart speakers and car navigation systems. However, very little is known to date about how to effectively extract meaningful information from a plethora of sensors that pound the computational engine of those devices. Therefore, in this paper, we provide and carefully analyze a simple baseline for audio-visual scene-aware dialog which is trained end-to-end. Our method differentiates in a data-driven manner useful signals from distracting ones using an attention mechanism. We evaluate the proposed approach on the recently introduced and challenging audio-visual scene-aware dataset, and demonstrate the key features that permit to outperform the current state-of-the-art by more than 20\% on CIDEr.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Question Answering | MUSIC-AVQA 1.0 (test) | AV Localis Accuracy74.53 | 96 | |
| Audio-Visual Question Answering | MUSIC-AVQA (test) | Acc (Avg)67.4 | 59 | |
| Audio Question Answering | MUSIC-AVQA 1.0 (test) | Counting Accuracy72.41 | 43 | |
| Overall Audio-Visual Question Answering | MUSIC-AVQA (test) | Overall Accuracy67.44 | 21 | |
| Audio Question Answering | MUSIC-AVQA (test) | Accuracy (Avg)68.52 | 17 | |
| Visual Question Answering | MUSIC-AVQA v1.0 (test) | Accuracy (Count)0.6739 | 16 | |
| Audio-Visual Question Answering | MUSIC-AVQA-R (test) | Audio QA Count (Head)54 | 13 | |
| Visual Question Answering | MUSIC-AVQA (test) | Accuracy (Counting)67.39 | 12 | |
| Audio-Visual Scene-Aware Dialog | AVSD (test) | CIDEr0.905 | 11 | |
| Audio-Visual Question Answering | AVQA (val) | Existence Accuracy81.61 | 9 |