Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ViSAGe: Video-to-Spatial Audio Generation

About

Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental results demonstrate that ViSAGe produces plausible and coherent first-order ambisonics, outperforming two-stage approaches consisting of video-to-audio generation and audio spatialization. Qualitative examples further illustrate that ViSAGe generates temporally aligned high-quality spatial audio that adapts to viewpoint changes.

Jaeyeon Kim, Heeseung Yun, Gunhee Kim• 2025

Related benchmarks

TaskDatasetResultRank
Spatial Audio GenerationMixed panoramic video-FOA dataset (YT360) (test)
wCS35
6
Video-to-spatial audio generationHybrid (test)
MOS (Subjective Quality)3.82
6
3D Audio-Visual Scene GenerationSONOSCENE360
D-CLAP Score (R Component)22.1
6
First Order Ambisonics (FOA) generationM2G-360 MoveSources (test)
MOS (Spatial Quality)2.67
6
First Order Ambisonics (FOA) generationM2G-360 Multi-Source (test)
MOS (Spatial Quality)2.64
6
First Order Ambisonics (FOA) generationM2G-360 Geometry (test)
MOS (Spatial Quality)2.56
6
FOA GenerationDyn360 Geometry
MOS (Spatial Quality)2.56
6
FOA GenerationDyn360 MoveSource
MOS-SQ2.67
6
FOA GenerationDyn360 MultiSource
MOS-SQ2.64
6
Spatial Audio SynthesisSphere360 (test)
MOS (Spatial Quality)2.62
6
Showing 10 of 11 rows

Other info

Follow for update