Object-aware Adaptive-Positivity Learning for Audio-Visual Question Answering

About

This paper focuses on the Audio-Visual Question Answering (AVQA) task that aims to answer questions derived from untrimmed audible videos. To generate accurate answers, an AVQA model is expected to find the most informative audio-visual clues relevant to the given questions. In this paper, we propose to explicitly consider fine-grained visual objects in video frames (object-level clues) and explore the multi-modal relations(i.e., the object, audio, and question) in terms of feature interaction and model optimization. For the former, we present an end-to-end object-oriented network that adopts a question-conditioned clue discovery module to concentrate audio/visual modalities on respective keywords of the question and designs a modality-conditioned clue collection module to highlight closely associated audio segments or visual objects. For model optimization, we propose an object-aware adaptive-positivity learning strategy that selects the highly semantic-matched multi-modal pair as positivity. Specifically, we design two object-aware contrastive loss functions to identify the highly relevant question-object pairs and audio-object pairs, respectively. These selected pairs are constrained to have larger similarity values than the mismatched pairs. The positivity-selecting process is adaptive as the positivity pairs selected in each video frame may be different. These two object-aware objectives help the model understand which objects are exactly relevant to the question and which are making sounds. Extensive experiments on the MUSIC-AVQA dataset demonstrate the proposed method is effective in finding favorable audio-visual clues and also achieves new state-of-the-art question-answering performance.

Zhangbin Li, Dan Guo, Jinxing Zhou, Jing Zhang, Meng Wang• 2023

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA 1.0 (test)	AV Localis Accuracy66.68	96
Audio-Visual Question Answering	MUSIC-AVQA (test)	Acc (Avg)70.96	94
Audio Question Answering	MUSIC-AVQA 1.0 (test)	Counting Accuracy82.4	43
Visual Question Answering	MUSIC-AVQA v1.0 (test)	Accuracy (Count)0.7652	16

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord