Few-Shot Audio-Visual Learning of Environment Acoustics
About
Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Novel-view Sound Synthesis | Soundspace-Ambient (Unseen Scenes) | STFT5.457 | 15 | |
| Novel-view Sound Synthesis | Soundspace-Ambient (Seen Scenes) | STFT5.937 | 15 | |
| Room Impulse Response (RIR) Prediction | Matterport3D (Seen environments) | STFT1.1 | 9 | |
| Room Impulse Response (RIR) Prediction | Matterport3D (Unseen environments) | STFT1.22 | 9 | |
| Binaural audio synthesis | N2S (test) | STFT1.765 | 9 | |
| Novel-view Sound Synthesis | N2S Benchmark real-world scene | STFT Error1.765 | 9 | |
| Depth Estimation | environments (unseen) | DPE1.45 | 7 | |
| Sound Source Localization | environments (unseen) | SLE64.6 | 7 | |
| Sound Source Localization | Environments (seen) | SLE50.3 | 6 | |
| Depth Estimation | Environments (seen) | DPE135 | 6 |