Visual Acoustic Matching

About

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal transformer model that uses audio-visual attention to inject visual properties into the audio and generate realistic audio output. In addition, we devise a self-supervised training objective that can learn acoustic matching from in-the-wild Web videos, despite their lack of acoustically mismatched audio. We demonstrate that our approach successfully translates human speech to a variety of real-world environments depicted in images, outperforming both traditional acoustic matching and more heavily supervised baselines.

Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman• 2022

Related benchmarks

Task	Dataset	Result
Novel-view Sound Synthesis	Soundspace-Ambient (Unseen Scenes)	STFT4.936	15
Novel-view Sound Synthesis	Soundspace-Ambient (Seen Scenes)	STFT5.224	15
Novel View Acoustic Synthesis	SoundSpaces-NVAS Single Environment	Mag0.161	12
Binaural audio synthesis	N2S (test)	STFT1.972	9
Novel-view Sound Synthesis	N2S Benchmark real-world scene	STFT Error1.972	9
Novel View Acoustic Synthesis	SoundSpaces-NVAS (Novel Environment)	Magnitude Score0.235	6
Visual Acoustic Matching	SoundSpaces-Speech unseen environments (test)	RTE (s)0.08	5
Visual Acoustic Matching	AVSpeech-Rooms unseen environments (test)	RTE (s)0.136	5
Visual Acoustic Matching	LibriSpeech unseen environments (test)	RTE (s)0.239	5
Visual Acoustic Matching	SoundSpaces-Speech (seen environments)	RTE (s)0.062	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord