Visual Acoustic Matching
About
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel-view Sound Synthesis | Soundspace-Ambient (Unseen Scenes) | STFT4.936 | 15 | |
| Novel-view Sound Synthesis | Soundspace-Ambient (Seen Scenes) | STFT5.224 | 15 | |
| Novel View Acoustic Synthesis | SoundSpaces-NVAS Single Environment | Mag0.161 | 12 | |
| Binaural audio synthesis | N2S (test) | STFT1.972 | 9 | |
| Novel-view Sound Synthesis | N2S Benchmark real-world scene | STFT Error1.972 | 9 | |
| Novel View Acoustic Synthesis | SoundSpaces-NVAS (Novel Environment) | Magnitude Score0.235 | 6 | |
| Visual Acoustic Matching | SoundSpaces-Speech unseen environments (test) | RTE (s)0.08 | 5 | |
| Visual Acoustic Matching | AVSpeech-Rooms unseen environments (test) | RTE (s)0.136 | 5 | |
| Visual Acoustic Matching | LibriSpeech unseen environments (test) | RTE (s)0.239 | 5 | |
| Visual Acoustic Matching | SoundSpaces-Speech (seen environments) | RTE (s)0.062 | 3 |