OmniAudio: Generating Spatial Audio from 360-Degree Video
About
Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, 360V2SA, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create Sphere360, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework OmniAudio, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at https://github.com/liuhuadai/OmniAudio. The project website is available at https://OmniAudio-360V2SA.github.io.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Audio-Visual Scene Generation | SONOSCENE360 | D-CLAP Score (R Component)39.7 | 6 | |
| First Order Ambisonics (FOA) generation | M2G-360 MoveSources (test) | MOS (Spatial Quality)3.92 | 6 | |
| First Order Ambisonics (FOA) generation | M2G-360 Multi-Source (test) | MOS (Spatial Quality)4.01 | 6 | |
| First Order Ambisonics (FOA) generation | M2G-360 Geometry (test) | MOS (Spatial Quality)3.61 | 6 | |
| FOA Generation | Dyn360 Geometry | MOS (Spatial Quality)3.61 | 6 | |
| FOA Generation | Dyn360 MoveSource | MOS-SQ3.92 | 6 | |
| FOA Generation | Dyn360 MultiSource | MOS-SQ4.01 | 6 | |
| Spatial Audio Generation | Mixed panoramic video-FOA dataset (YT360) (test) | wCS41 | 6 | |
| Spatial Audio Synthesis | Sphere360 (test) | MOS (Spatial Quality)3.96 | 6 | |
| Text-to-spatial audio generation | Spatial audio caption (test) | MOS (Spatial Quality)4.11 | 6 |