iQuery: Instruments as Queries for Audio-Visual Sound Separation

About

Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.

Jiaben Chen, Renrui Zhang, Dongze Lian, Jiaqi Yang, Ziyao Zeng, Jianbo Shi• 2022

Related benchmarks

Task	Dataset	Result
Sound Separation	MUSIC-clean+	CLAPt5.5	18
Audio Source Separation	MUSIC (test)	SDR11.17	16
Source Separation	VGGSound Clean (test)	IS12.82	10
Source Separation	MUSIC (test)	IS3.08	10
Image Query Sound Separation	VGGSOUND clean+	Mean SDR6.2	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord