Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

About

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations(i.e., the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as positivity. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand which objects are exactly relevant to the question and which are making sounds. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance.

Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy66.68
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)70.96
59
Audio Question AnsweringMUSIC-AVQA 1.0 (test)
Counting Accuracy82.4
43
Visual Question AnsweringMUSIC-AVQA v1.0 (test)
Accuracy (Count)0.7652
16
Showing 4 of 4 rows

Other info

Follow for update