Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised Visual Acoustic Matching

About

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio -- without acoustically mismatched source audio for reference. Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric that quantifies the level of residual acoustic information in the de-biased audio. Training with either in-the-wild web data or simulated data, we demonstrate it outperforms the state-of-the-art on multiple challenging datasets and a wide variety of real-world audio and environments.

Arjun Somayazulu, Changan Chen, Kristen Grauman• 2023

Related benchmarks

TaskDatasetResultRank
Visual Acoustic MatchingSoundSpaces-Speech unseen environments (test)
RTE (s)0.079
5
Visual Acoustic MatchingAVSpeech-Rooms unseen environments (test)
RTE (s)0.071
5
Visual Acoustic MatchingLibriSpeech unseen environments (test)
RTE (s)0.21
5
Visual Acoustic MatchingSoundSpaces-Speech (seen environments)
RTE (s)0.06
3
Visual Acoustic MatchingAVSpeech-Rooms (seen environments)
RTE (s)0.067
3
Showing 5 of 5 rows

Other info

Code

Follow for update