Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Visual Acoustic Matching

About

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman• 2022

Related benchmarks

TaskDatasetResultRank
Novel-view Sound SynthesisSoundspace-Ambient (Unseen Scenes)
STFT4.936
15
Novel-view Sound SynthesisSoundspace-Ambient (Seen Scenes)
STFT5.224
15
Novel View Acoustic SynthesisSoundSpaces-NVAS Single Environment
Mag0.161
12
Binaural audio synthesisN2S (test)
STFT1.972
9
Novel-view Sound SynthesisN2S Benchmark real-world scene
STFT Error1.972
9
Novel View Acoustic SynthesisSoundSpaces-NVAS (Novel Environment)
Magnitude Score0.235
6
Visual Acoustic MatchingSoundSpaces-Speech unseen environments (test)
RTE (s)0.08
5
Visual Acoustic MatchingAVSpeech-Rooms unseen environments (test)
RTE (s)0.136
5
Visual Acoustic MatchingLibriSpeech unseen environments (test)
RTE (s)0.239
5
Visual Acoustic MatchingSoundSpaces-Speech (seen environments)
RTE (s)0.062
3
Showing 10 of 11 rows

Other info

Follow for update