SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

About

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

Mingyeong Song, Seoyeon Ko, Junhyug Noh• 2026

Related benchmarks

Task	Dataset	Result	Rank
Visually Guided Mono-to-Binaural Audio Generation	FAIR-Play (10-split)	STFT Error0.82		5
Visually Guided Mono-to-Binaural Audio Generation	MUSIC Stereo	STFT Score41.7		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord