Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

About

In recent years, the deployment of large-scale pre-trained models in audio-visual downstream tasks has yielded remarkable outcomes. However, these models, primarily trained on single-modality unconstrained datasets, still encounter challenges in feature extraction for multi-modal tasks, leading to suboptimal performance. This limitation arises due to the introduction of irrelevant modality-specific information during encoding, which adversely affects the performance of downstream tasks. To address this challenge, this paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism. This mechanism leverages audio and visual modalities as soft prompts to dynamically adjust the parameters of pre-trained models based on the current multi-modal input features. Specifically, the DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders, allowing adaptive extraction of crucial information from the current modality across spatial, channel, and temporal dimensions, while preserving the frozen parameters of large-scale pre-trained models. Experimental evaluations demonstrate that our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, our model exhibits promising performance in challenging few-shot and zero-shot scenarios. The source code and pre-trained models are available at https://github.com/haoyi-duan/DG-SCT.

Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy65.91
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)74.8
59
Audio-Visual Event LocalizationAVE (test)
Accuracy82.2
37
Audio-Visual Event LocalizationAVE
Accuracy64.7
35
Audio-Visual SegmentationAVSBench MS3 (test)
Jaccard Index (IoU)53.5
30
Audio-Visual SegmentationAVSBench S4 (test)
MJ80.9
16
Audio-Visual Video ParsingLLP (test)
Audio Segment Score59
11
Audio-Visual SegmentationAVSBench MS3 setting (test)
MJ Score53.5
6
Audio-Visual Question AnsweringMUSIC-AVQA 2.0 (test)
Accuracy (Audio, Count)83.13
4
Showing 9 of 9 rows

Other info

Follow for update