SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision
About
Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visually Guided Mono-to-Binaural Audio Generation | FAIR-Play (10-split) | STFT Error0.82 | 5 | |
| Visually Guided Mono-to-Binaural Audio Generation | MUSIC Stereo | STFT Score41.7 | 5 |