Progressive Spatio-temporal Perception for Audio-Visual Question Answering

About

Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}

Guangyao Li, Wenxuan Hou, Di Hu• 2023

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA 1.0 (test)	AV Localis Accuracy71.8	96
Audio-Visual Question Answering	AVQA	Accuracy90.2	85
Audio-Visual Question Answering	MUSIC-AVQA (test)	Acc (Avg)73.52	76
Audio Question Answering	MUSIC-AVQA 1.0 (test)	Counting Accuracy73.97	43
Audio-Visual Question Answering	AVQA (test)	Total Accuracy90.2	36
Visual Question Answering	MUSIC-AVQA v1.0 (test)	Accuracy (Count)0.7715	16
Audio-Visual Question Answering	MUSIC-AVQA	Audio Count Acc73.97	14
Audio-Visual Question Answering	AVQA 69 (test)	Accuracy90.2	5
Audio-Visual Question Answering	MUSIC-AVQA	Accuracy (Audio)70.91	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord